Low-Rank Similarity Metric Learning in High Dimensions
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Low-Rank Similarity Metric Learning in High Dimensions Wei Liuy Cun Muz Rongrong Ji\ Shiqian Max John R. Smithy Shih-Fu Changz yIBM T. J. Watson Research Center zColumbia University \Xiamen University xThe Chinese University of Hong Kong fweiliu,[email protected] [email protected] [email protected] [email protected] [email protected] Abstract clude supervised metric learning which is the main focus in the literature, semi-supervised metric learning (Wu et al. Metric learning has become a widespreadly used tool in machine learning. To reduce expensive costs brought 2009; Hoi, Liu, and Chang 2010; Liu et al. 2010a; 2010b; in by increasing dimensionality, low-rank metric learn- Niu et al. 2012), etc. The suite of supervised metric learning ing arises as it can be more economical in storage and can further be divided into three categories in terms of forms computation. However, existing low-rank metric learn- of supervision: the methods supervised by instance-level la- ing algorithms usually adopt nonconvex objectives, and bels (Goldberger, Roweis, and Salakhutdinov 2004; Glober- are hence sensitive to the choice of a heuristic low-rank son and Roweis 2005; Weinberger and Saul 2009), the basis. In this paper, we propose a novel low-rank met- methods supervised by pair-level labels (Xing et al. 2002; ric learning algorithm to yield bilinear similarity func- Bar-Hillel et al. 2005; Davis et al. 2007; Ying and Li 2012), tions. This algorithm scales linearly with input dimen- and the methods supervised by triplet-level ranks (Ying, sionality in both space and time, therefore applicable Huang, and Campbell 2009; McFee and Lanckriet 2010; to high-dimensional data domains. A convex objective free of heuristics is formulated by leveraging trace norm Shen et al. 2012; Lim, McFee, and Lanckriet 2013). More regularization to promote low-rankness. Crucially, we methods can also be found in two surveys (Kulis 2012) and prove that all globally optimal metric solutions must (Bellet, Habrard, and Sebban 2013). retain a certain low-rank structure, which enables our Despite the large amount of literature, relatively little algorithm to decompose the high-dimensional learning work has dealt with the problem where the dimensional- task into two steps: an SVD-based projection and a met- ity of input data can be extremely high. This is relevant ric learning problem with reduced dimensionality. The to a wide variety of practical data collections, e.g., images, latter step can be tackled efficiently through employing videos, documents, time-series, genomes, and so on, which a linearized Alternating Direction Method of Multipli- ers. The efficacy of the proposed algorithm is demon- frequently involve from tens to hundreds of thousands of dimensions. Unfortunately, learning full-rank metrics in a strated through experiments performed on four bench- d mark datasets with tens of thousands of dimensions. high-dimensional input space R quickly becomes computa- tionally prohibitive due to high computational complexities ranging from O(d2) to O(d6:5). Additionally, limited mem- Introduction d×d ory makes storing a huge metric matrix M 2 R a heavy During the past decade, distance metric learning, typically burden. referring to learning a Mahalanobis distance metric, has be- This paper concentrates on a more reasonable learning come ubiquitous in a variety of applications stemming from paradigm, Low-Rank Metric Learning, also known as struc- machine learning, data mining, information retrieval, and tural metric learning. (Davis and Dhillon 2008) demon- computer vision. Many fundamental machine learning al- strated that in the small sample setting n d (n is the num- gorithms such as K-means clustering and k-nearest neigh- ber of training samples) learning a low-rank metric in a high- bor classification all need an appropriate distance function dimensional input space is tractable, achieving linear space to measure pairwise affinity (or closeness) between data and time complexities with respect to the dimensionality d. examples. Beyond the naive Euclidean distance, the dis- We follow this learning setting n d throughout the pa- tance metric learning problem intends to seek a better met- per. Given a rank r d, a distance metric can be expressed ric through learning with a training dataset subject to par- into a low-rank form M = LL> which is parameterized by ticular constraints arising from fully supervised or semi- a rectangular matrix L 2 d×r. L constitutes a low-rank supervised information, such that the learned metric can re- R basis in Rd, and also acts as a dimensionality-reducing lin- flect domain-specific characteristics of data affinities or re- ear transformation since any distance between two inputs x lationships. A vast number of distance metric learning ap- and x0 under such a low-rank metric M can be equivalently proaches have been proposed, and the representatives in- viewed as a Euclidean distance between the transformed in- Copyright c 2015, Association for the Advancement of Artificial puts L>x and L>x0. Therefore, the use of low-rank met- Intelligence (www.aaai.org). All rights reserved. rics allows reduced storage of metric matrices (only saving 2792 L) and simultaneously enables efficient distance computa- tion (O(dr) time). Despite the benefits of using low-rank metrics, the existing methods most adopt nonconvex opti- mization objectives like (Goldberger, Roweis, and Salakhut- dinov 2004; Torresani and Lee 2006; Mensink et al. 2013; Lim and Lanckriet 2014), and are thus confined to the sub- space spanned by a heuristic low-rank basis L0 even includ- ing the convex method (Davis and Dhillon 2008). In terms of the training costs, most low-rank metric learning approaches Figure 1: The effect of a desired low-rank similarity met- including (Goldberger, Roweis, and Salakhutdinov 2004; ric that induces a linear transformation L>. Points with the Torresani and Lee 2006; McFee and Lanckriet 2010; Shen same color denote similar data examples, while points with et al. 2012; Kunapuli and Shavlik 2012; Lim, McFee, and different colors denote dissimilar data examples. Lanckriet 2013) yet incur quadratic and cubic time complex- ities in the dimensionality d. A few methods such as (Davis and Dhillon 2008) and (Mensink et al. 2013) are known to metric M = LL> for a bilinear similarity function to mea- provide efficient O(d) procedures, but they both rely on and sure the similarity between two arbitrary inputs x and x0 in are sensitive to an initialization of the heuristic low-rank ba- Rd: 0 sis L . 0 > 0 > > 0 This work intends to pursue a convex low-rank metric SM(x; x ) = x Mx = x LL x > learning approach without resorting to any heuristics. Re- = L>x L>x0; (1) cently, one trend in similarity learning (Kar and Jain 2011; 2012; Chechik et al. 2010; Crammer and Chechik 2012; which is parameterized by a low-rank basis L 2 Rd×r with Cheng 2013) emerges, which attempts to learn similarity the rank r d. Naturally, such a similarity function charac- functions in the form of S(x; x0) = x>Mx0 and appears to terized by a low-rank metric essentially measures the inner- be more flexible than distance metric learning. In light of this product between two transformed inputs L>x and L>x0 in trend, we incorporate the benefits of both low-rank metrics a lower dimensional space Rr. and similarity functions to learn a low-rank metric M that Motivated by (Kriegel, Schubert, and Zimek 2008) which yields a discriminative similarity function S(·). Specifically, presented a statistical means based on angles rather than dis- we propose a novel low-rank metric learning algorithm by tances to identify outliers scattering in a high-dimensional designing a convex objective which consists of a hinge loss space, we would think that angles own a stronger discrim- enforced over pairwise supervised information and a trace inating power than distances in capturing affinities among norm regularization penalty promoting low-rankness of the data points in high dimensions. Suppose that we are pro- desired metric. Significantly, we show that minimizing such vided with a set of pairwise labels yij 2 f1; −1g of an objective guarantees to obtain an optimal solution of a which yij = 1 stands for a similar data pair (xi; xj) while ? ? > certain low-rank structure as M = UW U . The low- yij = −1 for a dissimilar pair (xi; xj). The purpose of our ? rank basis U included in any optimal metric M turns out supervised similarity metric learning is to make an acute an- to lie in the column (or range) space of the input data matrix gle between any similar pair whereas an obtuse angle be- d×n ? X 2 R , and W can thus be sought within a fixed sub- tween any dissimilar pair after applying the learned metric space whose dimensionality is much lower than the original M or equivalently mapping data points from x to L>x. For- dimensionality. mally, this purpose is expressed as By virtue of this key finding, our algorithm intelligently decomposes the challenging high-dimensional learning task > 0; yij = 1; SM(xi; xj) = (2) into two steps which are both computationally tractable and < 0; yij = −1: cost a total O(d) time. The first step is dimensionality re- duction via SVD-based projection, and the second step is Fig. 1 exemplifies our notion for those angles between simi- lower dimensional metric learning by applying a linearized lar and dissimilar data pairs. modification of the Alternating Direction Method of Multi- pliers (ADMM) (Boyd et al. 2011). The linearized ADMM Learning Framework maintains analytic-form updates across all iterations, works Let us inherit the conventional deployment of pairwise su- efficiently, and converges rapidly.