<<

Learning using the Born Rule

Lior Wolf∗

Abstract

In the transition from a deterministic description to a probabilistic

one is done using a simple rule termed the Born rule. This rule states that the of

an outcome (a) given a state (Ψ) is the square of their inner products ((a>Ψ)2).

In this paper, we will explore the use of the Born-rule-based for clustering,

feature selection, classification, and for comparison between sets. We show how these

probabilities lead to existing and new algebraic algorithms for which no other complete

probabilistic justification is known.

Keywords: Spectral methods, unsupervised feature selection, set to set similarity mea-

sures.

1 Introduction

In this work we form a connection between and probability theory. Spectral theory is a powerful tool for studying matrices and graphs and is often used in machine learning. However, the connections to the statistical tools employed in this field, are ad-hoc and domain specific. In this work we connect the two by examining the basic probability rule of .

Consider an affinity A = [aij] where aij is a similarity between points i and j in some dataset. One can obtain an effective clustering of the datapoints by simply thresholding

∗L. Wolf is with The Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02138. Email: [email protected].

1 the first eigenvector of this affinity matrix. Most previously suggested justifications of spectral clustering will fail to explain this: the view of spectral clustering as an approximation of the normalized minimal graph cut problem requires the use of the Laplacian; the view of spectral clustering as the infinite limit of a stochastic walk on a graph also requires a double stochastic matrix; other approaches can only explain this when the affinity matrix is approximately block diagonal.

In this work we show that by modeling class membership using the Born rule one attains the above spectral clustering algorithm as the two class clustering algorithm. Moreover, let

> v1 = [v1(1), v1(2), ...] be the first eigenvector of A, with an eigenvalue of λ1. We show that according to this model the probability of point j to belong to the dominant cluster is given by

2 λ1v1(j) . This result is detailed in section 4, as well as a justification to one of the most popular multiple-class spectral-clustering algorithms. In this section, which is in the heart of our paper, we also justify a successful feature selection algorithm known as the Q − α algorithm, derive similarity functions between sets and related retrieval algorithms, and suggest a very simple plug-in classification algorithm.

In Quantum Mechanics the Born rule is usually taken as one of the axioms. However, this rule has well-established foundations. Gleason’s theorem states that the Born rule is the only consistent for a structure. Wootters was the first one to note the intriguing observation that by using the Born rule as a probability rule, the natural Euclidean metric on a Hilbert space coincides with a natural notion of a statistical distance.

Recently, attempts were made to derive this rule with even fewer assumptions by showing it to be the only possible rule that allows certain invariants in the quantum model. We will briefly review these justifications for the Born rule in section 3. Physics based methods have been used in solving learning and optimization problems for a long time. Simulated annealing, the use of statistical mechanic models in neural networks, and heat equations based kernels are some examples. However, we would like to stress out that everything we say in this paper has grounds that are independent of any physical model. From a statistician point of view, the our paper could be viewed as a description of learning methods

2 that use the Born rule as a plug-in estimator.

2 The Quantum Probability Model

The quantum probability model takes place in a Hilbert space H of finite or infinite dimension 1.A state is represented by a positive definite linear mapping (a matrix ρ) from this space to itself, which has a of 1, i.e ∀Ψ ∈ H Ψ>ρΨ ≥ 0 , T r(ρ) = 1. Such a mapping ρ is self adjoint (ρ> = ρ, where throughout this paper > means complex conjugate and transpose) and

is called a .

> Since ρ is self adjoint its eigenvectors Φi are orthonormal (Φi Φj = δij), and since it is positive definite its eigenvalues pi are real and positive pi ≥ 0. The trace of a matrix is equal to P the sum of its eigenvalues and so i pi = 1.

P > The equality ρ = i piΦiΦi is interpreted as ”the system is in state Φi with probability pi”.

> The state ρ is called pure if ∃i s.t pi = 1. In this case, ρ = ΨΨ for some normalized state vector Ψ, and the system is said to be in state Ψ. Note that the representation of a mixed (not pure) state as a mixture of states with probabilities is not unique.

A measurement M with an outcome x in some set X is represented by a collection of P positive definite matrices {mx}x∈X such that x∈X mx = 1 (1 being the in H). Applying a measurement M to state ρ produces outcome x with probability

px(ρ) = T r(ρmx). (1)

Eq. 1 is the Born rule. Most quantum models deal with a more restrictive type of mea- surement called the von Neumann measurement, which involves a set of projection operators

> > 0 P > ma = aa for which a a = δaa0 . As before a∈M aa = 1. For this type of measurement

> > > the Born rule takes a simpler form: pa(ρ) = T r(ρaa ) = T r(a ρa) = a ρa. Assuming ρ is a

1The results described in this paper hold for Hilbert (complex vector spaces) as well as for real vector spaces.

3 pure state ρ = ΨΨ>, this can be simplified further to:

> 2 pa(ρ) = (a Ψ) . (2)

Since our algorithms will require the recovery of the parameters of unknown distributions,

we would use the simpler mode (eq. 2). This will reduce the number of parameters we fit minimal.

3 Whence the Born rule?

The Born rule has an extremely simple form that is convenient to handle, but why should it be considered justified as a probabilistic model for a broad spectrum of data? In quantum physics the Born rule is one of the axioms, and it is essential as a link between deterministic dynamics

of the states and the probabilistic outcomes. It turns out that this rule cannot be replaced with other rules, as any other probabilistic rule would not be consistent. Next we will describe three existing approaches for deriving the Born rule 2, and interpret their relation to learning problems.

Assigning probabilities when learning with vectors. Gleason’s theorem [12] derives the Born rule by assuming the Hilbert-space structure of (what we are trying to mea- sure). In this structure, each orthonormal basis of H corresponds to a mutually exclusive set of results of a given measurement.

Theorem 1 (Gleason’s theorem) Let H be a Hilbert space of a dimension greater than 2. Let f be a function mapping the one-dimensional projections on H to the interval [0,1] such that

P > for each orthonormal basis {Ψk} of H it happens that k f(ΨkΨk ) = 1. Then there exists a unique density matrix ρ such that for each Ψ ∈ H f(ΨΨ>) = Ψ>ρΨ.

By this theorem the Born rule is the only rule that can assign probabilities to all the mea- surements of a state (each measurement is given by a resolution of the identity matrix), such

2This will be done briefly as to not repeat already published work. The reader is referred to the more recent references for more details.

4 that these probabilities rely only on the density matrix of that specific state. To be more concrete: we are given examples (data points) as vectors, we are asked to assign them to clusters/classes. Naturally, we choose to represent each class with a single vector (in the simplest case). We do not want the clusters/classes to overlap. Otherwise, in almost any optimization framework we would get the most prominent cluster repeatedly. The simplest constraint to add is an orthogonality constraint on the unknown vector representations of the clusters. Having this simple model, and simple constraint, the question arises: what probability rule can be used? Gleason’s theorem suggests that there is only one possible rule under the constraint that all the probabilities must sum to one.

One may wonder why is it that our simple vector-based model above and the world of quan- tum mechanics result in the use of the same probability rule. There is nothing mystical about this. At the beginning of the last century physicists represented both states and observations as vectors. The only probability rule suitable for these models that use vectors as points as an operators is the Born rule.

Gleason’s theorem is very simple and appealing, and it is very powerful as a justification for the use of the Born rule [2]. However, the assumptions of Gleason’s theorem are somewhat restrictive. They assume that the algorithm that assigns probabilities to measurements on a state has to assign probabilities to all possible von Neumann measurements. Some Quantum Mechanic approaches, as well as our use of the Born rule in machine learning, do not require that probabilities are defined for all resolutions of the identity.

Axiomatic approaches. Recently a new line of inquiry has emerged that tries to justify the Born rule from the axioms of the decision theory. The first work in this direction was done by Deutsch [9]. An attempt was made to replace the probabilistic axioms of the Quantum theory with the non-probabilistic part of classical decision theory. While Barnum et-al [2] identified an ambiguity in Deutsch’s notation, Wallace [28] resolved that ambiguity and suggested several alternative derivations of the Born rule from decision theories. Lately, Saunders [21] derived the Born rule from ”operational assumptions.”

Saunders considers multiple-channels experiments. These are experiments that can be re-

5 peated such that at each repetition any sub-group of the channels can be blocked. In the special case where all but one channel are blocked, the outcome of the experiment is deterministic in the sense that if there is an outcome it is always the same one. It is assumed that for every possible outcome of the experiment there is a channel for which it is deterministic. The details can be found in appendix A (attached as a separate file).

Statistical approach. Another approach for justifying the Born rule, which is different from the axiomatic one above, is the approach taken by Wootters [33]. Wootters defines a natural notion of statistical distance between two states that is based on statistical distinguishingly. Assume a measurement with N outcomes. For some state the probability of each outcome is given by pi. By the central limit theorem, two states are distinguishable if for large n it happens

√ PN 2 that n/2 i=1(δpi) /pi is larger than 1, where δpi is the difference in their probabilities for outcome i . If they are indistinguishable than the frequencies of outcomes we get from the two states by repeating the measurement are not larger than the typical variation for one of the states. The distance between states Ψ and Ψ0 is simply the length of the shortest chain of states that connects the two states such that every two states in this chain are indistinguishable. Wooters shows that if we were to demand from nature that the statistical distance between states will be proportional to the angle between their vector representation, then we would get the Born rule as the resulting probability rule.

The implication of this result to machine learning is far reaching. Often, given two vectors in a real or imaginary Hilbert space we use the distances between the vectors as a similarity measure. This is very natural since the angle is the only Riemannian metric, up to a constant factor, which is invariant to all unitary transformations. Sometimes we would like to be able to derive relations which are more complex than just distances. For example, to define measures of distances between sets. Statistics and probabilities are a natural way to represent such relations, and the Born rule is a probabilistic rule which conforms with the natural metric between vectors.

6 4 Application of the Born Rule to Machine Learning Prob-

lems

We consider several applications. First, we consider clustering, and show that application of the Born rule for clustering leads naturally to a spectral clustering algorithm. Interestingly, Horn and Gottlieb [15] have developed a clustering method based on intuitions derived from Quantum Mechanics. Their method, however, is based on the Schrodinger¨ potential equation and is different from ours.

Second, we extend the scoring function used for clustering, so that we can ask what is a good kernel function? We use this scoring to justify an effective feature selection algorithm.

Third, we derive set to set similarities that provide us with a measure as to how much two sets of norm-1 vectors are distinguishable. We then derives algorithm that use these measures for the retrieval task. As the last application of the Born rule, we derive a simple two class classification engine. In order to eliminate much of the confusion, we will mainly use machine learning concepts and not physics-based concepts. The probabilities will be derived from the Born rule directly, without giving it any quantum physics interpretation.

4.1 Clustering

n Two class clustering. We are given a set of n norm-1 input samples {Φj}j=1 that we would like to cluster into k clusters. We model the probability of a point Φ to belong to cluster i as

> > > p(i|Φ) = ai ΦΦ ai, where ai aj = δij. i.e we use our input vectors as state vectors and use von Neumann measurement to determine cluster membership. We would like to maximize the empirical expectation of our input points to belong to any

k 1 Pn Pk Pk > of the clusters. Hence, we maximize L({ai}i=1) = n j=1 i=1 p(i|Φj) = i=1 ai ΓΦai , 1 Pn > where ΓΦ = n j=1 ΦjΦj . k k L({ai}i=1) is maximized when {ai}i=1 are taken to be the first k eigenvectors of ΓΦ [13]. This maximization is only defined up to some multiplication with a k × k unitary matrix.

7 > Consider the singular value decomposition of the matrix φ = [Φ1|Φ2|...|Φn] = USV , where the columns of U are the eigenvectors of φφ>, the matrix S is a diagonal matrix con- structed by the sorted square roots of the eigenvalues of φφ>, and the matrix V contains the

> eigenvectors of φ φ. Let ui (vi) denote the ith column of U (V ) and let si denote the ith element along the diagonal of S. The following equality holds for all i not exceeding the dimensions of

> φ: sivi = φ ui. If we are only interested in bipartite clustering we can use only the first eigenvector and get

> 2 2 p(1|Φj) = (u1 Φj) = (s1v1(j)) , vi(j) being the jth element of the vector vi. Hence, the probability of belonging to the first cluster is proportional to the first eigenvector of the data’s affinity matrix φ>φ. A bipartite clustering would be attained by thresholding this value. This is with agreement with most spectral clustering methods for bipartite clustering.

Several points need to be taken into consideration:

1. We did not normalize the affinity matrix, e.g used the Laplacian L as some spectral clus- tering methods do instead of using the original affinity matrix. We prefer to analyze this case, which is less understood, but works well in practice. The case of the Laplacian

is similar, with some modifications to take into account that the first eigenvector of the matrix cI − L (c ∈ <+) is a multiplication of vector of all ones.

2. We assume that the norm of each input point is 1. Spectral clustering is usually done using Gaussian or heat kernels [18, 4]. For these kernels the norm of each input point in

the feature space (the vector space in which the kernel is just a dot product [22]) is 1. A more complete treatment of this issue is done in the section 4.2.

3. Some spectral clustering methods [29, 20] use a threshold on the value of v1(j) and not on its square. However, by the Perron-Frobenious theorem for kernels with non negative

values (e.g Gaussian kernels) the first eigenvector of the affinity matrix is a same sign vector.

4. A related remark is that in our model the cluster membership probability of the point Φ is the same as of the point −Φ. In fact, it might be more appropriate to consider rays

8 of the form cΦ, c being some scalar. The norm-1 vector is just the representation of this ray. This distinction between rays and norm-1 vectors does not make much difference when considering kernels with non negative values – all the vectors in the feature space

are aligned to have a positive dot product.

Note that the use of the Born rule to get the clustering algorithm is more than just a justification for the same algorithm many are using. In addition to binary class memberships, we get a probability for each point to belong to each of the two clusters. However, as in many spectral clustering approaches, the situation is less appealing when considering more than two clusters. In that case the fact that the solution of the maximization problem above is given only up to a multiplication with a unitary matrix induces an inherent ambiguity.

Multiple class clustering Having a probability model, the most natural way to perform clustering would be to apply model based clustering. Following the spirit of [35], we base our

model-based clustering on affinities between every two data points that are based on our prob- ability model. To do so, we will be interested not in the probability of the cluster membership given the data point, but in the probability of the data point being generated by a given cluster [35], i.e two points are likely to belong to the same cluster if they have the same profile of clus- ∼ p(i|Φ)p(Φ) ter membership. This is given by the Bayes rule as p(Φ|i) = p(i) . We normalize the cluster membership probabilities such that the probability of generating each point by the k clusters is one – otherwise the scale produced by the probability of generating each point would harm the clustering. For example, all points with small p(Φ) would end up in one cluster. Each data point

is represented by the vector q of k elements where element i is q(i) = P p(i|Φ)/p(i) . k p(j|Φ)/p(j) j=1 > 2 The probability of cluster i membership given the point Φj is given by: p(i|Φj) = (ui Φj) = R R 2 ∼ (sivi(j)) . The prior probability of cluster i is estimated by p(i) = p(i, Φ)σ(Φ) = p(i|Φj)p(Φj)σ(Φ) =

1 Pn T 2 n j=1 p(i|Φj) = ui ΓΦui = si .

Hence the elements of the vector qj which represent a normalized version of the probabilities of the point Φj to be generated from the k clusters are estimated as:

2 2 2 n(sivi(j)) /si vi(j) qj(i) = Pk = Pk . n(s v (j))2/s2 v (j)2 l=1 l l l l=1 i Without the problem of an unknown unitary transformation (“rotation”), we can compare

9 two such vectors by using some sort of affinity between probabilities. For example, we may use P q q the affinity related to the Hellinger distance: affinity(qj, qj0 ) = i qj(i) qj0 (i). However, this is not invariant to the unitary transformation on the singular vectors of φ.

The NJW algorithm [18] is one of the most popular spectral clustering algorithms. It is very successful in practice [27], and is proven to be optimal in some ideal cases. The NJW algorithm

vi(j) considers the vectors rj(i) = Pk . The difference from the point wise square root ( v (j)2)1/2 l=1 l

of the vectors qj above is that the numerator can have both positive and negative values. The

> Hellinger affinity above would be different from the dot product rj rj0 if for some coordinate i the sign of rj(i) is different from the sign of rj0 (i). In this case the dot product will always be lower than the Hellinger affinity.

The NWJ algorithm clusters the rj representation of the data points using k-means clustering

k in R . The NJW therefore finds clusters (Ci, i = 1..k) and cluster centers (ci) as to minimize

Pk P 2 the sum of squared distances i=1 j∈Ci ||ci − rj||2. This clustering measure is invariant to the choice of basis of the space spanned by the first k eigenvectors of φ>φ. To see this it

P 2 is enough to consider distances of the form ||( i αiri) − rj||2, because the ci’s chosen by the k-means algorithm are linear combinations of the vectors ri in a specific cluster. Assume that the subspace spanned by the first k eigenvectors of φ>φ is rotated by some k × k unitary transformation O. Then each vector rj would be rotated by O and become Orj - there is no need to renormalize since O preserves norms. O preserves distances as well, and the αi’s are

P 2 P 2 not dependent of O, so ||( i αiri) − rj||2 = ||( i αiOri) − Orj||2. For simplicity allow us to remove the cluster center, which serves as a proxy in the k-means algorithm, and look for the cluster assignments which minimize: q q Pk P 2 2 P 2 0 0 0 0 i=1 j,j ∈Ci ||rj −rj ||2. The Hellinger distance D (qj, qj ) = i( qj(i)− qj (i)) is always

lower than the L2 distance between rj and rj0 and hence, the above criteria bounds from above

Pk P 2 2 0 0 the natural clustering score: i=1 j,j ∈Ci D (qj − qj ) . Note that for most cases this bound is tight since points in the same cluster have similar vector representations. This is especially the case in kernels for which all the vectors in the feature space are aligned to have positive dot products.

10 Finally, it is worth considering the following points regarding the proposed multiple class clustering algorithm:

1. To remove any doubt, our goal in this section is not to create a new clustering algorithm

but to give a more complete statistical explanation to the success of existing spectral clus- tering techniques. (This is true to the rest of the work as well. We prefer not to propose new algorithms, but to justify existing algorithms which were considered heuristic.)

2. The Kullback-Leibler (KL) divergence and the Hellinger distance are the most commonly used similarity measures between distributions. Although KL divergence has been pro- moted in the machine learning community by many researcher because of its relation with entropy, the Hellinger distance has a role just as important in the statistics commu- nity [10]. Our choice to use this distance are our will to derive the NJW algorithm (see the previous remark), and its other nice properties such as its symmetric nature. It is also

interesting to note that if two probability distributions are bounded from above, and from below by a constant larger than zero then Hellinger distance and the KL divergence are within constant from each other [10].

3. In the original NJW algorithm [11], the affinity matrix is normalized such that instead of the kernel A = φ>φ, the normalized version D−1/2AD−1/2 is used, where D is a diagonal matrix which holds the sums of the rows of A in its diagonal. While from our practitioner’s point of view this modification is a mere normalization of the affinity matrix (see Y.Weiss toturial on specral clustering given in NIPS 2002), others might con-

sider it more important. From the point of view of [4] this step is crucial because the normalized affinity matrix has the same eigenvectors as its Laplacian. D−1/2LD−1/2 = D−1/2(D − A)D−1/2 = I − D−1/2AD−1/2, and the last matrix has the same eigenvectors as D−1/2AD−1/2 (in reverse order). This normalization can be inserted to our framework, and in fact gives another insight into our interpretation of the NJW algorithm. The use of D−1/2AD−1/2 is similar to transforming the points from φ to φD−1/2. Consider kernels

with non-negative values. Assume that we build a model from the single point Φi. Let di

11 P > be the i-th elements along the diagonal of the matrix D. di = (Φi Φj). This is exactly the Hellinger affinity between the distribution of points j = 1..n being in that cluster of

> 2 one point (Φi Φj) and the uniform distribution (Note that the first distribution should be √ 2 normalized by maxi di ). Hence, normalizing point Φi by 1/ di is very similar to normal- izing by the norm when using least square methods. The difference is that here instead of the norm (distance from zero) we normalize by the Hellinger affinity to the uniform distribution over the data points. Since the Hellinger affinity is that which we use in our algorithm, and the uniform distribution a natural baseline, this seems very reasonable.

4.2 Feature selection

We will next deal with the problem of feature selection. Our goal in this section is to give probabilistic grounds to the success of the Q − α algorithm [31]. The Q − α is a very effective unsupervised variable weighting algorithm (see also [23, 32]) which was designed based upon intuitions from spectral approaches. Here we will show a constructive way of deriving the score function used in that algorithm. In order to do so, we will first generalize our likelihood model by incorporating priors into it.

Incorporating the class based likelihood into the total likelihood. Remember that in the previous section we first defined the probability of a point Φ to belong to cluster i as p(i|Φ) =

> > ai ΦΦ ai. We maximized the empirical expectation of our input points to belong to any of the

k 1 Pn Pk Pk > clusters, which we gave as L({ai}i=1) = n j=1 i=1 p(i|Φj) = i=1 ai ΓΦai. In this simple model we did not use priors on the class membership.

> For the multiple class case we estimated these priors as p(i) = ai ΓΦai. In fact we could have used a slightly more complicated model for the two class clustering as well, and in- serted this prior into our likelihood. Thus, we can redefine the cluster membership proba-

> > > bility: pˆ(i|Φ) = p(i)p(i|Φ) = (ai ΓΦai)(ai ΦΦ ai). The likelihood function now becomes: P P P P ˆ k 1 n k k > > k > L({ai}i=1) = n j=1 i=1 pˆ(i|Φj) = i=1(ai ΓΦai)(ai ΓΦai) ≤ i=1(ai ΓΦΓΦai).

Since the eigenvectors of a matrix (e.g, ΓΦ) and of its square (ΓΦΓΦ) are the same, and

since the inequality becomes an equality when the ai-s are the eigenvectors of ΓΦ , this would

12 not change much of the result of the two class classification (for the multiple class case we

3 already used the priors) . The major difference would be that the probability of a point Φj of

> > 2 4 2 belonging to the first cluster would be p(1|Φj) = (u1 ΓΦu1)(u1 Φj) = s1v1(j) (here as before

u1, v1 and s1 are given by the singular value decomposition of φ = [Φ1|Φ2|...|Φn], and are the first eigenvectors of φφ>, φ>φ, and the square root of the eigenvalue associated with these eigenvectors, respectively).

Dealing with points of any norm. To make our methods more flexible, let us add another prior - a prior on the points. The main goal of this prior is to allow us to deal with points that do not have a norm of one. We suggest using the prior pˆ(Φ) = (Φ>Φ) = ||Φ||2. Using this prior each point Φ is viewed as the point Φ/||Φ|| with an importance proportional to the square of the norm of the point.

The point based prior allows us to apply the Born rule, as is, to points of any norm. This

> 2 4 > (ai Φ) > 2 is because p¯(ai|Φ) = p(Φ)p(ai|Φ/||Φ||) = (Φ Φ) (Φ>Φ) = (ai Φ) . Thus the form of the probability rule for points of any norm is the same as the Born rule (which is originally limited to norm-1 vectors). This means that using this prior all our results above hold for vectors of any norm. Note that the point based prior is self consistent in the following sense: the sum over any

resolution of the identity {ai}, of the probabilities p¯(ai|Φ) is just the prior p(Φ).

X X X > 2 > > > p¯(ai|Φ) = (ai Φ) = Φ (aiai )Φ = Φ Φ i i i

Feature selection. While in section 4.1 we used the likelihood to optimize for the model ˆ of each class (simply put: the vectors ai, i = 1..k) we can use the likelihood L as a scoring P ˆ k k 4 function of the quality of our data, i.e if arg max{ai} L({ai}i=1) = i=1 σi is large then we know that our data can be clustered well by the Born-rule-based spectral-clustering techniques.

3If the reader believes that we should have started with a derivation that contains the per class prior for the two class clustering case, please write me a note. Our aim was to make the first result as simple as possible. In QM our use of priors over our vector model seems to match well with the construction of observables from simple measurements. 4 We abuse the notation a bit so that the relation between p(i|Φ) and ai is more clear.

13 This allows us to “improve” our data 5. In the feature selection case we would like to select a subset of the variables such that the resulting data would have a large score, i.e if we have l variables, we would like find binary

vector α of l variables such that the sum of the fourth power of the singular values of diag(α)φ is maximized. The diagonal matrix diag(α) would determine which variables (rows of φ) were selected.

Optimizing that score over all possible assignments of α is a problem of exponential com- √ plexity. The Q − α algorithm circumvents this problem by assigning positive weights α to

each one of the variables, where 0 ≤ αi ≤ 1. The optimization problem then becomes:

Xk Xk Xk 4 α> > α 2 > > 2 maxα σi = maxα (vi φ D(α)φvi ) = maxα,Q (qi φ D(α)φqi) = i=1 i=1 i=1

> > > = maxα,QT r(Q (φ D(α)φ)(φ D(α)φ)Q) ,

α α > where Vα = [v1 |..|vk ] is the matrix of the eigenvectors of φ D(α)φ, and where Q =

[q1|..|qk] is constrained to be an n (the number of the vectors Φj) times k (number of clusters) matrix with orthonormal columns. The above score is dependent on the scale of α, and so the norm of α is restricted to be one. Note that no restrictions are put on α to make it a vector of positive entries. Still, the vector α which maximizes the score is very likely to have all entries which are of the same sign. This property, along with arguments predicting sparsity of the entries of α, can be found in [31].

> > > The optimization of the score maxα,QT r(Q (φ D(α)φ)(φ D(α)φ)Q) is carried using an iterative algorithm which interweaves the computation of α and improvements of the matrix Q. Details can be found in [31].

5It would be interesting to compare the simple score we derived here of the quality of the data to the one that can be derived using Gaussian Processes [16]. If we consider a Gaussian Process with a kernel C of size n × n, then the most likely orthogonal set of k data-points would be the first k eigenvectors of C. This can be used to give a partial explanation for the success of spectral clustering. The likelihood of these k data-points (given > −1 n 2 as columns of the matrix t) would be L = −1/2 log det C − 1/2T r(t C t) − n/2 log 2π = log Πi=1σi − Pk 2 1/2 i=1 1/σi − n/2 log 2π. Likelihoods of this sort are often optimized in the Gaussian Process literature, but only for a few parameters. This is done in the supervised case, where t is the actual data, and the missing parameter is, for example, the width of the Gaussian. It is by far less tractable to optimize than our simple score.

14 In case k is unknown, we may choose to assume it equals n. This is because in our score smaller singular values contribute much less than the larger ones. In the case where k = n there is no need to optimize for the eigenvectors Q and the score simply becomes

> > P > 2 T r(φ D(α)φφ D(α)φ) = i,j αiαj(Φi Φj) . This bilinear problem is optimized when α is

> 2 the first eigenvector of the matrix whose (i,j)-th element is (Φi Φj) . The effect of the choice of taking k = n is demonstrated on a synthetic dataset in Fig. 1. In this experiment, the first three variables were picked from four multivariate normal distribu- tions with diagonal covariance matrices. The remainder of the 200 variables are selected from the same distributions but were each permuted independently. In this way, the remaining 200 features give no information as to the underlying cluster from which the data points stem. As can be seen, the assignment of weights by taking the first PCA of the data is uninformative for feature selection purposes. When k is selected appropriately the Q − α algorithm assigns very high weights to the relevant variables, and much less to the rest. When k is selected to its maximum possible values, Q − α still successfully detects the relevant variables, but assigns somewhat higher weights to the irrelevant variables. We use a similar setting in Fig. 2, where we take only two clusters (to remove clutter) to demonstrate that, although the Q − α algorithm is “merely” a linear weighting algorithm and, although it has relations to the PCA algorithm, applying Q − α as a preprocessing before PCA can prove to be very important 6.

An important remark about our score function: assume we would like to use the likelihood L (the one that does not contain priors) in order to derive a feature selection scheme. Our score

k Pk 2 > would become arg max{ai} L({ai}i=1) = i=1 σi = T r(φ φ), where the last equality is due to the equality between a trace of a matrix and the sum of its eignevalues. However, kernels (e.g, φ>φ) with a large trace are poor in terms of generalization [7]. Generalization is crucial for unsupervised feature selection, just as it is in the supervised case, e.g we would like to get good clustering, but we would not like to get a kernel Aα which supports any random clustering. This is exactly why we need to add the per-cluster prior: without it (using a score based on L) a

6We find that the role of feature selection is not well understood, even by some very experienced researchers. For example, many people believe that SVM will always pick a hyperplane which uses only the relevant features because ”it chooses the best hyperplane”. If the reader shares similar views we would like to refer him/her to [26, 6, 1]

15 0.7 first PCA

0.6 k=4

k=n 0.5

0.4

0.3

0.2

0.1

0

−0.1

−0.2 0 5 10 15 20 25 30 35 40 45 50

Figure 1: Comparison of PCA, Q−α with the correct number of clusters (k = 4) and Q−α with k = n on a synthetic dataset. The first three variable are relevant. Only the first 50 variables out of the 203 are shown. The case with k = 4 produces sharper results than the case where k = n, but both succeed in selecting the relevant variables. Unlike popular beliefs, PCA does not detect the relevant variables.

3

2 6

4 1

2

0 0

−2 −1

−4 5 −2 −6 −3 −2 0 −1 0 1 2 3 −5 −3 4 −8 −6 −4 −2 0 2 4 6 (a) (b)

100 0.25

0.2

50 0.15

0.1

0 0.05

0

−50 −0.05

−0.1

−100 −0.15

−0.2

−150 −0.25 −150 −100 −50 0 50 100 150 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 (c) (d)

Figure 2: This figure demonstrates the importance of feature selection in the unsupervised setting, and that ,although Q − α and PCA use similar optimization functions, they are very different in their capabilities of dealing with irrelevant variables: (a) Three relevant dimensions out of 203. The rest of the dimensions are similar but were permuted to remove any class membership information; (b) The PCA of only the relevant dimensions; (c) PCA of the whole 203 dimensions; (d) The first two columns of the Q matrix, which are similar to applying PCA after weighting according to the α weights recovered by the Q − α algorithm.

16 clustering where every point is a cluster by itself would be considered a good clustering. Using the per-cluster prior, each such one-point-cluster would have a very low prior, and the score based on the likelihood Lˆ will be much lower.

Note that in the Q − α algorithm, the trace of the kernel is controlled by the constraint on

the norm of α: Let fi, i = 1..l be the l rows of the matrix φ = [Φ1|...|Φn]. The kernel matrix

> Pl > can be written as a sum of rank one matrices φ diag(α)φ = k=1 αkfkfk , each with a trace

> > > Pl > T r(fkfk ) = fk fk, and so the trace of the kernel matrix T r(φ diag(α)φ) = k=1 αlfk fk.

Before application of the Q − α algorithm we usually normalize each fk to have a norm of one, so that the results will be independent of the scale of each feature. Hence the trace of the kernel P matrix is just l αl. Under the constraint that α has a norm of one, this trace is larger when the α vector is more uniform. However, the maximization of the score function based on the squares of the eigenvalues of the kernel matrix encourages a sparse solution [31].

4.3 Set to Set similarities

Another machine learning tool that we consider is the construction of a similarity function between two sets of norm-1 vectors. Given a matrix φ whose columns are the vectors in our

n set {Φi}i=1, we can view it as a set of pure states, or alternatively as the mixed state ΓΦ =

> Pn > (1/n)φφ = (1/n) i=1 ΦiΦi . m Given a new set {Ψj}j=1, we would like to find a similarity between the two sets. Let ψ be a matrix whose columns are the vectors of the new set. We can use, for example, either one of these similarity measures:

1 Pm > 1 Pn Pm > 2 1 T 2 (a) m j=1 Ψj ΓΦΨj = nm i=1 j=1(Φi Ψj) = nm ||φ ψ||F P m > Ψ ΓΦΓΦΨj P P T 2 1 j=1 j 1 n m > > > 1 ||φφ ψ||F (b) m T r(Γ Γ ) = > i,j=1 k=1(Φi Φj)(Φi Ψk)(Φj Ψk) = nm T 2 Φ Φ mT r(φφ ) ||φφ ||F

These similarities are based on applying the Born rule in its simplest form. However, they are unnatural in the sense that if we model set elements as states, then it seems that we apply the Born rule not between a state and a measurement but between two states. The underlying reason is quite simple: the measurement that maximizes the distinguishably of a state (Φ) from

17 other states has the same vector representation as Φ [33]. Method (a) therefore measures the expectation of observing on the second set of measur- ments that best identify the pure states which appear in the first set. However, this does not give

the best global measure of identifying the set φ in the following sense: It does not maximize the

1 Pn > expectation n i=1 Φi ρΦi subject to ρ being a density matrix. > This expectation is maximized by letting ρ = aa where a is the first eigenvector of ΓΦ 7. However, to measure the similarity of two sets by just considering one observation does not produce good results. The underlying problem is the problem of over-fitting, and is somewhat similar to the need for regularization in the Expected Risk Minimization framework. Method (b) is a trade off between maximizing the expectancy above 8 and having a more “round” similarity measure. The measurements used in similarity (b) are the columns of the

matrix ΓΦ, which are weighted combinations of the measurements used in (a). Although both methods were derived in an asymmetric way, that is, treating one set as the target set, and another as a set that needs to be distinguished from it, the similarity method (a) is a positive definite similarity measures between sets. Therefore it can be used together with any kernel based learning algorithms. To see this property it is enough to note that

P > 2 P > P P (1/nm) ij(Φi Ψi) = (1/nm) ij(χ(Φi) χ(Ψi)) = (1/nm)( i χ(Φi))( j χ(Ψj)), where χ is a mapping from a vector to the vector of all its second order monomials. This positive definite similarity function between sets is a different one than the one described in [30], where the determinant of the matrix φ>ψ and similar functions where used.

Using the similarity between sets for retrieval. The retrieval problem we consider is the following: we are given a query set A containing at least one example from a specific category, and a set B of unlabeled examples. Our task is to rank all the points in set B from those which

7 1 Pn > > n i=1 Φi ρΦi = T r(ΓΦρ) ≤ maxunitary U T r(ΓΦU ρU) which according to Theorem 3.1 of [8] is less than the dot product between the vectors that contain the eigenvalues of the matrices ΓΦ and ρ (i.e it is less than P > i λi(ΓΦ) λi(ρ)). To maximize that expression under the constraint that the sum of the eigenvectors is one, we would like ρ to have just one non vanishing eigenvector. In the case ρ = aa> we get a strict equality. > > 8 T r(φφ φφ ) The expectation which is the similarity of the set φ to itself by method (a) is given by T r(φφ>) and T r(φφ>φφ>φφ>φ) the one given by method (b) is T r(φφ>φφ>) . The latter is never smaller than the former since for any set of > > > 2 > > > > positive eigenvalues of the matrix φφ denoted by {λi} we have T r(φφ φφ ) −T r(φφ )T r(φφ φφ φφ φ) = P 2 2 P P 2 P 2 P 2 P P 2 ( i λi ) − i λi j λjλj = j λj ( i λi − i λiλj) = − i>j(λi − λj) λiλj ≤ 0.

18 are more likely to be of the same category of the examples in set A to those that are less likely. Below we offer two methods, RMa and RMb.

The first method (RMa) treats each point as a set of one point and ranks each point in B by

its similarity using the set similarity method (a) to the set of labeled points A. The points in set B are then ranked from the most similar point to the least similar point. This method is very simple, but it does not incorporate information on the structure of set B until the last stage of sorting the per-point similarity scores. √ The second method (RMb) weights each point i in set B by a weight αi, and tries to maximize the similarity between the resulting set and set A 9. Thus we search for a vector

> 2 ||φdiag(α)φ ψ||F α which maximizes > 2 , where the matrix φ holds the elements of set B and the ||φdiag(α)φ ||F matrix ψ holds the elements of set A. Both the numerator and the denominator are clearly

α>Gα bilinear in α and so this can be written as α>Hα , the maximization of which is just a generalized eigenvector problem, and is given up to scale. Following the algebra one can verify that G[ij] =

P > > > > 2 k(Φi Φk)(Φi Ψj)(Φk Ψj), and H[ij] = (Φi Φj) . The ranking is by sorting of the values of α. We set the sign of the value of α which has the largest magnitude to be positive, and sort all of the α values to get a ranking of all the examples in set B. Note that, although it is not guaranteed that the entries of the vector α would be all of the same sign, in almost all of the experiments we ran this was the case. If some of the values happened to differ in sign from the value which had the largest magnitude, in our experience they were of a very small magnitude.

Note that RMb examines not only the similarity of single elements in set B to set A but tries to find the subset of B which is the most similar to A. Also, note that the set-to-set similarity (b) is normalized such that sets which are very similar to themselves (highly clustered) are penalized in the denominator of this similarity, giving balance to this similarity score.

9A similar method using set similarity (a) would result in method RMa. This is because the solution for > √ 2 > 2 arg maxα ||ψ φdiag( α)||F , s.t||α|| = 1 is proportional to the vector with elements ||ψ Φi||F .

19 4.4 Classification

There are several ways in which one can derive classification algorithms based on Born rule probabilities. Here we will derive an existing classifier which is very similar to the use of density kernel estimators [14] for classification. This is an effective classification method, and we are going to base it on the set to set similarity (a) given above. Our derivation method is

designed for norm-1 vectors only. In practice we normalize the vectors with accordance to the kernel used (e.g normalize each vector by its norm for the linear kernel), or simply use an RBF kernel. Given a set of n positive examples, as columns of a matrix φ, we can for any new example

+ 1 T 2 Ψ compute the similarity according to method (a) above O (Ψ) = n ||φ Ψ|| . This measure of similarity is a normalized expectation, and we can use a simple ratio test to classify the point Ψ.

O+(Ψ) i.e we classify Ψ to have a positive label if 1−O+(Ψ) > 1. Given a set of negative examples we can build a second measure O− and combine those using the na¨ıve Bayes approach. The classification engine we use in the experiments below

O+(Ψ) 1−O−(Ψ) simply checks if: 1−O+(Ψ) O−(Ψ) > 1. This is a very simple plug-in decision rule, which requires virtually no training. As a plug-in classifier it generalizes well [11].

5 Experiments

The clustering algorithm we suggest amounts to a variant of spectral clustering which was ex- tensively studied [27, 18]. Therefore we will not present extensive new clustering experiments.

However, there is a point we would like to make about classification of out-of-sample points. Since we use a model-based approach our clustering can be applied to any new point without re-clustering. This is in contrast to the original NJW algorithm and to the original Laplacian Eigenmaps [4]. To overcome this problem, a term of regularization was recently added to the Laplacian Eignemaps algorithm [5].

In our framework out of sample extensions come naturally since for any new point Ψ we can compute out model based probabilities. These probabilities can be expressed as function

20 1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

−0.4 −0.4

−0.6 −0.6

−1 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 (a) (b)

Figure 3: This figure demonstrates out-of-sample classification using the Born rule spectral clustering scheme. (a) the original training dataset of the two moons example of [5]. (b) the resulting clustering together with an out of sample classification of the nearby 2D plane.

> 2 of probabilities of the form (ui Ψ) , where ui is an eigenvector of the input data’s covariance matrix. Since the points in our kernel space can have a very high or perhaps infinite dimension, we cannot compute the eigenvectors ui directly. However, we can use the equality used in

> 2 kernel PCA: ui = (1/si)φvi (as before, vi are the eigenvectors of the kernel φ φ, si are the eignevalues, and the equality is rooted in the Singular Value Decomposition). To compute the

> > > > dot product ui Ψ = (1/si)vi φ Ψ we only need to know the elements of the vector φ Ψ. This vector contains the value of the kernel function computed between every element in our training set (the columns of φ) and the new example Ψ. The above out-of-sample classification scheme is demonstrated in Fig. 3. We used the two moons example of [5], and a Gaussian kernel similar to the one used there. Unlike [5], there is no need to have any labeled examples.

5.1 Feature selection experiments

The feature selection method that we suggested is the Q − α method, which was shown to be effective on vision, genomics and on other datasets [31, 23, 32]. Still, as it is less known and

accepted than the spectral clustering algorithms, we will present new experiments. In those new experiments we focus on supervised tasks to check the quality of our unsupervised feature selection. As it turns out, the Q − α out performs many supervised feature selection algorithms (see also [32]). Experiments showing the performance of the Q − α algorithm in the unsuper-

21 vised case are presented in the conference papers [31, 23]. Since we did not want to optimize over the free parameter used for the feature selection k (generally referred to as the number of classes), we present results in the case where k = n (n being the number of examples at hand).

In this case the Q − α algorithm is non-iterative, so there is also a considerable saving of run time.

Face Recognition. We first applied Q − α for the task face recognition. We used two publicly available datasets: the YALE dataset [34] and the AR dataset [17]. The YALE dataset contains 15 different persons, each one photographed 11 times under different illumination, with different expression and with or without glasses. For the AR dataset we used only the 25 males, each having 52 images. For both datasets we created 20 instances of similar experiments, where three images per person were picked randomly to be the training set for that person. The

rest served as the test set. We compared several algorithms: the eigenface method [25] which uses PCA to reduce the dimension of the face images; the fisherface method [3] which applies multi-class Fisher dis- criminant analysis (FDA) to face images after they have been reduced in dimension using PCA; both methods after applying conventional feature selection algorithms ; an application of eigen- faces after variable weighting using the Q − α algorithm; and application of fisherfaces after variable weighting using the Q − α algorithm. All methods where tested for gray level images and for wavelet. Results in the table are the maximal score out of these two options. These results are only higher for gray values when using the variants without any feature selection.

Q − α and the other feature selection methods did not seem to work well when dealing with gray level values directly 10. In order to remove any doubt, we would like to emphasize again that the results were not biased by using different kinds of features. Each method was tested on

10In the heart of these phenomena lies a basic discrepancy between feature selection and dimensionality reduc- tion algorithms. While feature selection algorithms (Fisher score, Q − α , and in a sense Ada-boost) do well with uncorrelated data, they perform much worse when many of the features are correlated. Dimensionality reduction algorithms (PCA, FDA, and in a sense SVM) thrive on correlation, but fail when dealing with a large number of uncorrelated features. Gray values are highly correlated, therefore PCA, FDA and SVM work well on them. Wavelets are not correlated, and therefore are more suitable for boosting, feature selection, and na¨ıve baysian ap- proaches. As shown in the experiments, the Q − α algorithm enables PCA, SVM and FDA to work with wavelet features (In the past wavelets and SVM were combined for pedestrian detection [19]. For that end a correlated over-complete wavelet basis was used).

22 DataSet PCA FS + PCA Q − α +PCA Fisher FS + Fisher Q − α +Fisher Yale 70.0 ± 2.5 67.6 ± 4.9 81.2 ± 2.4 81.8 ± 3.1 83.4 ± 2.7 88.2 ± 2.4 AR 62.9 ± 2.4 62.9 ± 4.4 70.6 ± 1.8 90.4 ± 1.8 90.9 ± 1.6 94.6 ± 0.8 Table 1: Success rates for the face recognition task on two databases. For each experiment three training images per person were used, and the rest were used for testing. Each experiment was repeated 20 times and the results in the table are mean performance and standard deviation. The algorithms are PCA (eigenfaces), PCA following a traditional supervised feature selection (best out of Pearson coefficients, Fisher critetion score, and the KS test), PCA following Q − α weighting of the variables, and three similar variants of FDA (fisherfaces).

both types, and the reported performance is the best result of the two. In Table 1, each column is a different algorithm. Each row is a different dataset. The values in each cell are the average recognition rate (in percents) over 20 runs, and the standard derivation of the results. Results are shown for all methods for the PCA dimension which gave the best results. For feature selection (FS) we show the best out of three supervised methods

(Pearson coefficients, Fisher criterion score, and the Kolmogorov-Smirnov test) and for the best percentile of kept features. It turns out that for these datasets, the Fisher criterion score did the best out of these three, and this performance occurred when keeping 40-50% of the features.

As can be seen, the use of wavelets together with the Q − α algorithm improves the results both for eigenfaces and for fisherfaces. For the YALE dataset running FDA on Q− α weighting of wavelet features was superior to the FDA results on gray levels 75% of the tries. The methods gave the same performance in 15% of the tries, and only in 10% of the tries the plain FDA method won. For the AR dataset, the method using wavelets+Q − α +FDA gave the best performance in all runs. Place recognition. For this experiment we used the data collected by Torralba, Murphy, Freeman and Rubin, in order to compare supervised learning algorithms. Given an image, the goal is to put it into one of ten categories: conference-room, corridor, elevator lobby, inside elevator, kitchen, lab, office, open area, plaza and street. The data consists of 50757 frames collected over 17 sequences using a wearable camera. While the method of [24] achieved good identification results by using temporal information, the task of identifying a single frame is quite difficult. For example, it is not easy to distinguish a lab from an office or a conference

23 room. Each frame in the place recognition dataset was represented by a vector of 384 dimensions, consisting of the output of steerable filters applied to the input 120 × 160 image at several scales. This representation, together with the frame annotation, was made publicly available by the authors of [24].

We conducted repeated one-vs-all experiments. In each experiment 100 random examples of one place catagory served as the set of positive training examples, and 100 random examples from the rest of the frames served as negative examples. The results were then tested on a much larger set of testing examples, which contained the same number of positive and negative examples. Each such one-vs-all experiment was repeated three times.

In [24] the authors used the gentle boost algorithm on top of PCA. In our experiments, SVM

always outperfoms the boosting algorithm (we used the same implementation, provided to us by Torralba). This is probably due to a relatively small number of training examples used in our experiments. Also, in all our experiments, PCA did not help at all (not even for boosting), hence we used the original 384 dimensional data. The results we got are 26.67% percent error for an 80 dimensional PCA followed by a linear SVM, 25.10% for a linear SVM, and 23.17% error for Q − α weighting of the features followed by an SVM. We could improve the results a bit by applying a novel supervised version of Q − α (not reported here) and get 22.83% error.

5.2 Retrieval experiments

In order to check our retrieval experiments we compared three methods: ranking by distance to the closest element in the query set A (NN), which is probably the most commonly used method; ranking using RMa (the method based on set-to-set similarity a); and RMb (the method based on set-to-set similarity b). To avoid any bias in favor of our method we only used the linear kernel (NN could not be improved by a Guassian kernel, and we wanted to avoid the appearance of tilting the results using multiple tests with several parameters ).

In the first set of experiments, we used datasets from the UCI machine learning repository.

24 Method NN RMa RMb SplitRatio 5% 10% 30% 5% 10% 30% 5% 10% 30% Dermatology .51 .52 .52 .87 .87 .88 .83 .80 .78 Ecoli .97 .97 .98 .92 .92 .95 .87 .91 .95 Glass .44 .38 .36 .43 .36 .35 .62 .60 .62 Letter Recognition .68 .68 .68 .73 .74 .74 .90 .94 .96 Segmentation .96 .96 .96 .90 .90 .95 .84 .84 .85 Wine .94 .93 .95 .80 .85 .88 .05 .04 .04 Yeast .58 .58 .60 .57 .57 .60 .53 .54 .63 Table 2: Retrieval experiments on several UCI datasets

In each experiment only 40% of each dataset was used (this is important to avoid bias because RMb uses all the examples to determine the ranking). We split each partial (40%) dataset into two set A and B for various sizes of set A. A split ratio of 20% means that 20 percent of the largest class were used as set A and the rest of all the examples as set B. Each such experiment was repeated 20 times, and the average results are shown below. As a measure of success we used the area under the resulting ROC curve generated by correct and incorrect retrievals from set B. The results are summarized in table 2. They show that for many datasets the methods RMa and RMb preform significantly better than NN. However, this is not the case for all datasets.

Sometimes NN does better.

Note that in general, we should not expect a monotonic performance as set A grows, since the ratio of positive examples in set B becomes worse. Also, note that sometimes method RMb can “lock” onto the wrong class. This happens if the negative examples contain a very tight cluster, while the positive ones are very loosely distributed. Another set of retrieval experiments was conducted by Stan Bileschi. In out dataset we had 80 image patches taken from car images. These patches (“parts”) were selected automatically using an interest (some are not specific to any car part like wheel or mirror, and are just

patches around the point given by the interest operator.). For each part we got 10-50 possible matches in our image dataset, using normalized cross correlation with a predefined threshold. Our goal was to use these possible matches to train a part classifier. We were hoping to get 20

25 Method NN RMa RMb Gray values 2.21 2.07 2.46 Wavelets 2.16 2.69 1.44 Table 3: Retrieval results for the car parts dataset. Entries show the average over all parts of the ratio between the number of correct retrievals in the first 20 answers and the expected number of retrievals .

positive training images for each part, which is enough to train a very reliable classifier. For verification purposes, we manually marked all correct matches. In each dataset there were between 5-20 of those. An example is shown in Fig. 4. Since every part had a different number of good and false candidates, we needed a normalized score for retrieval success. The score that we used is described next. Each algorithm is asked to provide 20 retrievals. A high score is given to an algorithm that does better than that which was expected from the random pick. The score is the ratio the number of good matches produced by the algorithms and the

number expected by chance using a random pick, i.e if there are 15 good examples out of 30 candidate matches, we expect to get 10 out of 20 by chance. If we get 15 correct matches, the score is 1.5. The best average score one can hope using this scale for our specific dataset is about 3.

To apply the set-to-set comparison method, we created a small set by taking the original example and the two closest examples. This set of three examples forms set A. In order to get reliable retrievals, we tried several algorithms (for examples we tried some dimensionality reduction algorithms and then NN, geodesic NN, methods similar to tangent distances), but NN

(using normalized cross correlation) seems to do just as well on this dataset. We tried two types of features. Gray level values of the image (put as one long vector), and Haar wavelet features. For both types of features, NN does pretty well but not good enough for our purposes. It scores around 2.2 for gray level features, and for wavelet features. The algorithm denoted as RMa does much better for wavelets, and scores almost 2.7. The other set to set comparison method RMb does better than NN only for gray level features where it scores more than 2.4. The full results are summarized in the Table 3.

A third set of retrieval experiments was conducted using the place recognition dataset of

26 Figure 4: An example of a part in our dataset. A green box marks a good detection, a red box marks false detections. The task was to automatically find all good detections given the one example on the upper left corner

27 0.85

RMa RMb mean distance NN 0.8

0.75

0.7

0.65

0.6 0 2 4 6 8 10 12 14 16 18

Figure 5: A comparison of several retrieval methods applied to the place recognition dataset. The X axis is the number of images used as the query set A, and the Y axis shows the resulting area under the ROC curve for each one of the four methods. While RMa fails to achieve good results on this dataset, RMb does best. NN and mean distance to query set do not perform well.

[24] described above. In each experiment, one class out of the ten was picked as the query class and another as the distraction class. 400 examples from each class were picked to form a set of 800 examples (this number was limited because some classes did not have many examples). The query set A consisted of one to 18 examples of the query class, while set B contained all other examples. We compared four methods, where, in addition to NN, RMa and RMb, we ranked the points in set B according to the average distance between each point and all the points in set A. to The results in Figure 5 show the area under the ROC curve averaging over 60 runs.

5.3 Classification experiments

We performed some two class classification experiments. In these experiments we compared the classification algorithm derived in section 4 to SVM using linear or RBF kernels and to the nearest neighbor classifier (the kernel is irrelevant). The parameter σ for the RBF kernel

was searched for each experiment from a large range using cross validation. In general, both SVM and the Born classifier showed the same behavior when changing the parameter. For SVM we show the results obtained when using the best values of the parameter C out of the set [0, 0.1, 1, 10, 100, 1000]. Table 4 shows the error rates in percents averaged over 20 trials and their standard deviation. In each experiment 40% of the data was used for training and 60% for

28 Method Linear SVM RBF SVM Linear Born RBF Born NN Dermatology (*) 3.1 ± 1.0 3.0 ± 1.1 9.2 ± 1.5 3.1 ± 1.0 5.6 ± 4.7 Ecoli 2.3 ± 1.2 1.7 ± 0.8 4.4 ± 0.8 2.5 ± 1.0 2.9 ± 0.8 Glass 29.9 ± 3.6 22.1 ± 5.0 25.6 ± 5.1 24.3 ± 3.5 22.3 ± 4.6 Letter Recognition 1.8 ± 0.7 0.2 ± 0.1 0.3 ± 0.3 0.3 ± 0.2 0.3 ± 0.3 Segmentation (*) 0.3 ± 0.8 0.3 ± 0.8 5.7 ± 7.0 0 ± 0 0 ± 0 Wine 5.2 ± 3.0 8.2 ± 2.3 40.0 ± 26.1 13.5 ± 4.8 9.2 ± 2.4 Yeast 64.0 ± 24.8 33.8 ± 1.3 34.9 ± 1.6 34.6 ± 1.6 40.5 ± 1.5 Table 4: Classification experiments on several UCI datasets. The datasets were randomly di- vided to 40% training 60% testing. Each experiment was repeated 20 times. (*) for these datasets only 10% of the samples were used for training (otherwise the problem is too easy). testing, except where noted, when this task seemed too easy. It is apparent from the result that the Born classification method does as well as SVM and NN on most datasets.

6 Summary

In this work we demonstrated how a simple probability rule can justify simple and effective algebraic algorithms. We explored in details existing and new algorithms for spectral clustering, feature selection, data retrieval, and classification. It has become evident that our model-based approach is generally applicable.

We hope that using the techniques developed in this paper many insights into other success- ful existing algorithms may be further gained, and that the proposed framework will serve as a constructive path for the derivation of future algorithms.

References

[1] H. Almuallim and T .G .Dietterich. Learning with many irrelevant features. Proc. 9th Nat. Conf. on AI, 1991.

[2] H. Barnum, C. M. Caves, J. Finkelsterin, C. A. Fuchs, and R. Schack. Quantum Theory from Decision Theory? Proc. of the Roy. Soc. of London A456, 2000.

29 [3] P.N. Belhumeur, J.P. Hespanha and D.J. Kreigman Eigenfaces vs. Fisherfaces: Recogni- tion Using Class Specific Linear Projection PAMI 19(7), 1992.

[4] M. Belkin, and P. Niyogi Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. In Neural Computation 14, 2002.

[5] M. Belkin, P. Niyogi, and V. Sindhwani Manifold Regularization: a Geometric Framework for Learning from Examples The University of Chicago CS Technical Report TR-2004-06.

[6] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. AI, 97(1-2), 1997.

[7] O. Bousquet, and D.J.L. Herrmann On the Complexity of Learning the Kernel Matrix. NIPS, 2003.

[8] I.D. Coope and P.F. Renaud. Trace Inequalities with Applications to Orthogonal Regres- sion and Matrix Nearness Problems. Research Report UCDMS2000/17, 2000.

[9] D. Deutsch. Quantum Theory of Probability and Decisions. Proc. of the Roy. Soc. A455,

1999.

[10] L. Devroye. A Course in Density Estimation. Birkhauser,¨ 1987.

[11] L. Devroye, L. Gyorfi,¨ G. Lugosi. A Probababilistic Theory of Pattern Recog. Springer, 1996.

[12] A. Gleason. Measures on the closed subspaces of a hilbert space. In Journal of Mathe- matics and Mechanics 6, 885-894. 1957.

[13] G.H. Golub and C.F. Van Loan. Matrix computations. , 1989.

[14] T. Hastie, R. Tibshirani, J. H. Friedman The Elements of Statistical Learning Springer, 2001.

[15] D. Horn and A. Gottlieb, Algorithms for Data Clustering in Pattern Recognition Problems

based on Quantum Mechanics. In Physical Review Letters 88(1), 2002.

30 [16] D. Mackay, Introductino to Gaussain Processes. (A review paper). Available at: http://www.inference.phy.cam.ac.uk/mackay/GP/

[17] A.M. Martinez and R. Benavente. The AR face database. Tech. Rep. 24, CVC, 1998.

[18] A.Y. Ng, M.I. Jordan, Y. Weiss. On Spectral Clustering: Analysis and an algorithm. NIPS,2001.

[19] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna and T. Poggio. Pedestrian Detection Using Wavelet Templates. In CVPR 1997.

[20] P. Perona and W. T. Freeman. A factorization approach to grouping. In ECCV, 1998.

[21] S. Saunders. Operational Derivation of the Born Rule. in submission.

[22] B. Scholkopf and A.J. Smola. Learning with Kernels The MIT press, 2002.

[23] A. Shashua and L. Wolf. Kernel Feature Selection with Side Data using a Spectral Ap- proach. Proc. of the European Conference on Computer Vision (ECCV), May 2004

[24] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition, IEEE Intl. Conference on Computer Vision (ICCV), 2003.

[25] M. Turk and Pentland ”Face Recognition Using Eigenfaces”, In ICPR, 1991.

[26] Weston, J., S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik. Feature Selection for SVMs. NIPS, 2001.

[27] D. Verma and M. Meila. A comparison of spectral clustering algorithms. UW CSE T.R., 2003.

[28] D. Wallace, Everettian Rationality: defending Deutsch’s approach to probability in the

Everett interpretation. In Studies in the History and Philosophy of Modern Physics 34, 2003.

31 [29] Y. Weiss. Segmentation using eigenvectors: a unifying view. In ICCV, 1999.

[30] L. Wolf and A. Shashua. Learning over Sets using Kernel Principal Angles. In JMLR, 4, 2003.

[31] L. Wolf and A. Shashua. Direct feature selection with implicit inference. ICCV, 2003.

[32] L. Wolf, A. Shashua and S. Mukherjee. Selecting Relevant Genes with a Spectral Ap- proach. AI MEMO AIM-2004-002

[33] W.K. Wootters. Statistical distance and Hilbert space. In Physical Review D, 1991.

[34] Yale Univ. Face Database. Available at http://cvc.yale.edu/projects/yalefaces/yalefaces.html

[35] S. Zhong and J. Ghosh A Unified Framework for Model-based Clustering. JMLR 4, 2003

32