Latent Coincidence Analysis: a Hidden Variable Model for Distance Metric Learning

Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning Matthew Der and Lawrence K. Saul Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093 fmfder,[email protected] Abstract We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model's continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model's main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective. 1 Introduction In this paper we propose a simple but new model to learn informative linear projections of multivariate data. Our approach is rooted in the tradition of latent variable modeling, a popular methodology for discovering low dimensional structure in high dimensional data. Two well-known examples of latent variable models are factor analyzers (FAs), which recover subspaces of high variance [1], and Gaussian mixture models (GMMs), which reveal clusters of high density [2]. Here we describe a model that we call latent coincidence analysis (LCA). The goal of LCA is to discover a latent space in which metric distances reflect meaningful notions of similarity and difference. We apply LCA to two problems in distance metric learning, where the goal is to improve the performance of a classifier|typically, a k-nearest neighbor (kNN) classifier [3]|by a linear transformation of its input space. Several previous methods have been proposed for this problem, including neighborhood component analysis (NCA) [4], large margin neighbor neighbor classification (LMNN) [5], and information-theoretic metric learning (ITML) [6]. These methods|all of them successful, all of them addressing the same problem|beg the obvious question: why yet another? One answer is suggested by the different lineages of previous approaches. NCA was conceived as a supervised counterpart to stochastic neighborhood embedding [7], an unsupervised method for dimensionality reduction. LMNN was conceived as a kNN variant of support vector machines [8]. ITML evolved from earlier work in Bregman optimizations|that of minimizing the LogDet divergence subject to linear constraints [9]. Perhaps it is due to 1 x z x z x z W, y W, y y x' z' z' x' z' x' N Figure 1: Bayesian network for latent coincidence analysis. The inputs x; x0 2 <d are mapped into Gaussian latent variables z; z0 2 <p whose statistics are parameterized by the linear transformation W 2 <p×d and noise level σ. Coincidence in the latent space at length scale κ is detected by the binary variable y 2 f0; 1g. Observed nodes are shaded. these different lineages that none of these methods completely dominates the others. They all offer improvements in kNN classification, yet arguably their larger worth stems from the related work they have inspired in other areas of pattern recognition. Distance metric learning is a fundamental problem, and the more solutions we have, the better equipped we are to solve its myriad variations. It is in this spirit that we revisit the problem of distance metric learning in the venerable tradition of latent variable modeling. We believe that LCA, like factor analysis and Gaussian mixture modeling, is the simplest latent variable model that can be imagined for its purpose. In particular, the inference in LCA (though not purely Gaussian) is tractable, and the distance metric is re-estimated at each iteration of its EM algorithm by a simple least- squares update. This update has stronger guarantees of convergence than the gradient- based methods in NCA; it also sidesteps the large number of linear inequality constraints that appear in the optimizations for LMNN and ITML. For all these reasons, we believe that LCA deserves to be widely known. 2 Model We begin by describing the probabilistic model for LCA. Fig. 1 shows the model's repre- sentation as a Bayesian network. There are three observed variables: the inputs x; x0 2 <d, which we always imagine to be observed in pairs, and the binary label y 2 f0; 1g, which indicates if the inputs map (or are desired to be mapped) to nearby locations in a latent space of equal or reduced dimensionality p ≤ d. These locations are in turn represented by the Gaussian latent variables z; z0 2 <p. Each node in the Bayesian network is conditionally dependent on its parents. The con- ditional distributions P (zjx) and P (z0jx0) are parameterized by a linear transformation W 2 <p×d (from the input space to the latent space) and a noise level σ2. They take the simple Gaussian form: 2 −p=2 − 1 kz−Wxk2 P (zjx) = (2πσ ) e 2σ2 ; (1) 0 0 2 −p=2 − 1 kz0−Wx0k2 P (z jx ) = (2πσ ) e 2σ2 : (2) Finally, the binary label y 2 f0; 1g is used to detect the coincidence of the variables z; z0 in the latent space. In particular, y follows a Bernoulli distribution with mean value: 0 − 1 kz−z0k2 P (y =1jz; z ) = e 2κ2 : (3) Eq. (3) states that y = 1 with certainty if z and z0 coincide at the exact same point in the latent space; otherwise, the probability in eq. (3) falls off exponentially with their squared distance. The length scale κ in eq. (3) governs the rate of this exponential decay. 2 2.1 Inference Inference in this model requires averaging over the Gaussian latent variables z; z0. The required integrals take the form of simple Gaussian convolutions. For example: Z P (y =1jx; x0) = dz dz0P (y =1jz; z0) P (zjx) P (z0jx0) (4) κ2 p=2 kW(x−x0)k2 = exp − : (5) κ2 +2σ2 2 (κ2 +2σ2) Note that this marginal probability is invariant to uniform re-scalings of the model param- eters W, σ, and κ; we will return to this observation later. For inputs (x; x0), we denote the relative likelihood, or odds, of the event y =1 by P (y =1jx; x0) ν(x; x0) = : (6) P (y =0jx; x0) As we shall see, the odds appear in the calculations for many useful forms of inference. Note that the odds ν(x; x0) has a complicated nonlinear dependence on the inputs (x; x0); the numerator in eq. (6) is Gaussian, but the denominator (equal to one minus the numerator) is not. Of special importance for learning (as discussed in section 2.2) are the statistics of the posterior distribution P (z; z0jx; x0; y). We obtain this distribution using Bayes rule: P (yjz; z0) P (zjx) P (z0jx0) P (z; z0jx; x0; y) = : (7) P (yjx; x0) We note that the prior distribution P (z; z0jx; x0) is multivariate Gaussian, as is the posterior distribution P (z; z0jx; x0; y = 1) for positively labeled pairs of examples. However, this is not true of the posterior distribution P (z; z0jx; x0; y = 0) for negatively labeled pairs. In this respect, the model differs from classical factor analysis and other canonical models with Gaussian latent variables (e.g., Kalman filters). Despite the above wrinkle, it remains straightforward to compute the low-order moments of the distribution in eq. (7) for both positively (y =1) and negatively (y =0) labeled pairs1 of examples. In particular, for the posterior means, we obtain: νσ2 E[zjx; x0; y =0] = W x − (x0 −x) ; (8) κ2 + 2σ2 σ2 E[zjx; x0; y =1] = W x + (x0 −x) ; (9) κ2 + 2σ2 where the coefficient ν in eq. (8) is shorthand for the odds ν(x; x0) in eq. (6). Note how the posterior means E[zjx; x0; y] in eqs. (8{9) differ from the prior mean E[zjx; x0] = Wx: (10) Analogous results hold for the prior and posterior means of the latent variable z0. Intuitively, these calculations show that the expected values of z and z0 move toward each other if the observed label indicates a coincidence (y =1) and away from each other if not (y =0). For learning it is also necessary to compute second-order statistics of the posterior distribution. For the posterior variances, straightforward calculations give: 2 2 0 2 νσ E kz − z¯k x; x ; y =0 = pσ 1 + ; (11) κ2 + 2σ2 2 2 0 2 σ E kz − z¯k x; x ; y =1 = pσ 1 − ; (12) κ2 + 2σ2 1For the latter, the statistics can be expressed as the differences of Gaussian integrals. 3 where z¯ in these expressions denotes the posterior means in eqs. (8{9), and again the coefficient ν is shorthand for the odds ν(x; x0) in eq. (6). Note how the posterior variances in eqs. (11{12) differ from the prior variance 2 0 2 E kz − Wxk x; x = pσ : (13) Intuitively, we see that the posterior variance shrinks if the observed label indicates a coincidence (y = 1) and grows if not (y = 0). The expressions for the posterior variance of the latent variable z0 are identical due to the model's symmetry. 2.2 Learning Next we consider how to learn the linear projection W, the noise level σ2, and the length scale κ2 from data.

Load more