Proceedings of the Twenty-Ninth AAAI Conference on

Low-Rank Similarity Learning in High Dimensions

Wei Liu† Cun Mu‡ Rongrong Ji\ Shiqian Ma§ John R. Smith† Shih-Fu Chang‡ †IBM T. J. Watson Research Center ‡Columbia University \Xiamen University §The Chinese University of Hong Kong {weiliu,jsmith}@us.ibm.com [email protected] [email protected] [email protected] [email protected]

Abstract clude supervised metric learning which is the main focus in the literature, semi-supervised metric learning (Wu et al. Metric learning has become a widespreadly used tool in . To reduce expensive costs brought 2009; Hoi, Liu, and Chang 2010; Liu et al. 2010a; 2010b; in by increasing dimensionality, low-rank metric learn- Niu et al. 2012), etc. The suite of supervised metric learning ing arises as it can be more economical in storage and can further be divided into three categories in terms of forms computation. However, existing low-rank metric learn- of supervision: the methods supervised by instance-level la- ing algorithms usually adopt nonconvex objectives, and bels (Goldberger, Roweis, and Salakhutdinov 2004; Glober- are hence sensitive to the choice of a heuristic low-rank son and Roweis 2005; Weinberger and Saul 2009), the basis. In this paper, we propose a novel low-rank met- methods supervised by pair-level labels (Xing et al. 2002; ric learning algorithm to yield bilinear similarity func- Bar-Hillel et al. 2005; Davis et al. 2007; Ying and Li 2012), tions. This algorithm scales linearly with input dimen- and the methods supervised by triplet-level ranks (Ying, sionality in both space and time, therefore applicable Huang, and Campbell 2009; McFee and Lanckriet 2010; to high-dimensional data domains. A convex objective free of heuristics is formulated by leveraging trace norm Shen et al. 2012; Lim, McFee, and Lanckriet 2013). More regularization to promote low-rankness. Crucially, we methods can also be found in two surveys (Kulis 2012) and prove that all globally optimal metric solutions must (Bellet, Habrard, and Sebban 2013). retain a certain low-rank structure, which enables our Despite the large amount of literature, relatively little algorithm to decompose the high-dimensional learning work has dealt with the problem where the dimensional- task into two steps: an SVD-based projection and a met- ity of input data can be extremely high. This is relevant ric learning problem with reduced dimensionality. The to a wide variety of practical data collections, e.g., images, latter step can be tackled efficiently through employing videos, documents, time-series, genomes, and so on, which a linearized Alternating Direction Method of Multipli- ers. The efficacy of the proposed algorithm is demon- frequently involve from tens to hundreds of thousands of dimensions. Unfortunately, learning full-rank metrics in a strated through experiments performed on four bench- d mark datasets with tens of thousands of dimensions. high-dimensional input space R quickly becomes computa- tionally prohibitive due to high computational complexities ranging from O(d2) to O(d6.5). Additionally, limited mem- Introduction d×d ory makes storing a huge metric matrix M ∈ R a heavy During the past decade, distance metric learning, typically burden. referring to learning a Mahalanobis distance metric, has be- This paper concentrates on a more reasonable learning come ubiquitous in a variety of applications stemming from paradigm, Low-Rank Metric Learning, also known as struc- machine learning, data mining, information retrieval, and tural metric learning. (Davis and Dhillon 2008) demon- computer vision. Many fundamental machine learning al- strated that in the small sample setting n  d (n is the num- gorithms such as K-means clustering and k-nearest neigh- ber of training samples) learning a low-rank metric in a high- bor classification all need an appropriate distance function dimensional input space is tractable, achieving linear space to measure pairwise affinity (or closeness) between data and time complexities with respect to the dimensionality d. examples. Beyond the naive Euclidean distance, the dis- We follow this learning setting n  d throughout the pa- tance metric learning problem intends to seek a better met- per. Given a rank r  d, a distance metric can be expressed ric through learning with a training dataset subject to par- into a low-rank form M = LL> which is parameterized by ticular constraints arising from fully supervised or semi- a rectangular matrix L ∈ d×r. L constitutes a low-rank supervised information, such that the learned metric can re- R basis in Rd, and also acts as a dimensionality-reducing lin- flect domain-specific characteristics of data affinities or re- ear transformation since any distance between two inputs x lationships. A vast number of distance metric learning ap- and x0 under such a low-rank metric M can be equivalently proaches have been proposed, and the representatives in- viewed as a Euclidean distance between the transformed in- Copyright c 2015, Association for the Advancement of Artificial puts L>x and L>x0. Therefore, the use of low-rank met- Intelligence (www.aaai.org). All rights reserved. rics allows reduced storage of metric matrices (only saving

2792 L) and simultaneously enables efficient distance computa- tion (O(dr) time). Despite the benefits of using low-rank metrics, the existing methods most adopt nonconvex opti- mization objectives like (Goldberger, Roweis, and Salakhut- dinov 2004; Torresani and Lee 2006; Mensink et al. 2013; Lim and Lanckriet 2014), and are thus confined to the sub- space spanned by a heuristic low-rank basis L0 even includ- ing the convex method (Davis and Dhillon 2008). In terms of the training costs, most low-rank metric learning approaches Figure 1: The effect of a desired low-rank similarity met- including (Goldberger, Roweis, and Salakhutdinov 2004; ric that induces a linear transformation L>. Points with the Torresani and Lee 2006; McFee and Lanckriet 2010; Shen same color denote similar data examples, while points with et al. 2012; Kunapuli and Shavlik 2012; Lim, McFee, and different colors denote dissimilar data examples. Lanckriet 2013) yet incur quadratic and cubic time complex- ities in the dimensionality d. A few methods such as (Davis and Dhillon 2008) and (Mensink et al. 2013) are known to metric M = LL> for a bilinear similarity function to mea- provide efficient O(d) procedures, but they both rely on and sure the similarity between two arbitrary inputs x and x0 in are sensitive to an initialization of the heuristic low-rank ba- Rd: 0 sis L . 0 > 0 > > 0 This work intends to pursue a convex low-rank metric SM(x, x ) = x Mx = x LL x > learning approach without resorting to any heuristics. Re- = L>x L>x0, (1) cently, one trend in similarity learning (Kar and Jain 2011; 2012; Chechik et al. 2010; Crammer and Chechik 2012; which is parameterized by a low-rank basis L ∈ Rd×r with Cheng 2013) emerges, which attempts to learn similarity the rank r  d. Naturally, such a similarity function charac- functions in the form of S(x, x0) = x>Mx0 and appears to terized by a low-rank metric essentially measures the inner- be more flexible than distance metric learning. In light of this product between two transformed inputs L>x and L>x0 in trend, we incorporate the benefits of both low-rank metrics a lower dimensional space Rr. and similarity functions to learn a low-rank metric M that Motivated by (Kriegel, Schubert, and Zimek 2008) which yields a discriminative similarity function S(·). Specifically, presented a statistical means based on angles rather than dis- we propose a novel low-rank metric learning algorithm by tances to identify outliers scattering in a high-dimensional designing a convex objective which consists of a hinge loss space, we would think that angles own a stronger discrim- enforced over pairwise supervised information and a trace inating power than distances in capturing affinities among norm regularization penalty promoting low-rankness of the data points in high dimensions. Suppose that we are pro-  desired metric. Significantly, we show that minimizing such vided with a set of pairwise labels yij ∈ {1, −1} of an objective guarantees to obtain an optimal solution of a which yij = 1 stands for a similar data pair (xi, xj) while ? ? > certain low-rank structure as M = UW U . The low- yij = −1 for a dissimilar pair (xi, xj). The purpose of our ? rank basis U included in any optimal metric M turns out supervised similarity metric learning is to make an acute an- to lie in the column (or range) space of the input data matrix gle between any similar pair whereas an obtuse angle be- d×n ? X ∈ R , and W can thus be sought within a fixed sub- tween any dissimilar pair after applying the learned metric space whose dimensionality is much lower than the original M or equivalently mapping data points from x to L>x. For- dimensionality. mally, this purpose is expressed as By virtue of this key finding, our algorithm intelligently  decomposes the challenging high-dimensional learning task > 0, yij = 1, SM(xi, xj) = (2) into two steps which are both computationally tractable and < 0, yij = −1. cost a total O(d) time. The first step is dimensionality re- duction via SVD-based projection, and the second step is Fig. 1 exemplifies our notion for those angles between simi- lower dimensional metric learning by applying a linearized lar and dissimilar data pairs. modification of the Alternating Direction Method of Multi- pliers (ADMM) (Boyd et al. 2011). The linearized ADMM Learning Framework maintains analytic-form updates across all iterations, works Let us inherit the conventional deployment of pairwise su- efficiently, and converges rapidly. We evaluate the proposed pervision adopted by most metric learning methods, e.g., algorithm on four benchmark datasets with tens of thousands (Xing et al. 2002; Bar-Hillel et al. 2005; Davis et al. 2007; of dimensions, demonstrating that the low-rank metrics ob- Davis and Dhillon 2008; Wu et al. 2009; Hoi, Liu, and tained by our algorithm accomplish prominent performance Chang 2010; Liu et al. 2010a; 2010b; Niu et al. 2012; gains in terms of kNN classification. Ying and Li 2012). This paper tackles a supervised simi- larity metric learning problem provided with a set of labeled Low-Rank Similarity Metrics pairs I = {(xi, xj, yij)}. Suppose that n ( d) data ex- d n Different from most previous metric learning approaches amples X = {xi ∈ R }i=1 are gathered in the set I. In- that used the learned metric into a distance function like spired by the aforementioned angle-based learning principle p(x − x0)>M(x − x0), in this paper we pursue a low-rank in Eq. (2), we design a margin criterion over the similarity

2793 outputs of the supervised pairs, that is, SM(xi, xj) ≥ 1 for Optimality Characterization yij = 1 and SM(xi, xj) ≤ − ( is a small positive value) Despite being convex, the framework in Eq. (3) is not easy for yij = −1. Note that setting the margin values to 1 and to directly tackle when the data points of X live in a space -1 for positive and negative pairs in the context of similarity of high dimensions, i.e., d is very large. Under the high- learning may be infeasible for multi-class data. For exam- dimensional scenario, basic matrix manipulations upon M ple, there are data points x1, x2, and x3 from three different are too expensive to execute. For example, projecting M classes. Imposing SM(x1, x2) ≤ −1, SM(x1, x3) ≤ −1, d 3 onto S+ costs O(d ). Nonetheless, by delving into problem and SM(x2, x3) ≤ −1 is problematic since the former two (3) we can characterize a certain low-rank structure of its likely lead to SM(x2, x3) > 0. all optimal solutions. In particular, such a low-rank structure This similarity margin criterion ensures that after apply- can be proven to be the form of UWU>, in which U is a > ing the linear transformation L induced by a desired met- low-rank basis lying in the column space of the data matrix ric M, the angle between a similar pair (xi, xj) is acute X. > > >  since their inner-product L xi L xj = SM(xi, xj) Let us do the singular value decomposition (SVD) over d×n is large enough, and meanwhile the angle between a dissim- the data matrix X = [x1, ··· , xn] ∈ R , obtaining > Pm > ilar pair is obtuse since their inner-product is negative. X = UΣV = i=1 σiuivi , where m (≤ n  d) is In order to promote low-rankness of the target metric, we the rank of X, σ1, ··· , σm are the positive singular values, d×m leverage nuclear norm kMk∗, the sum of all singular values Σ = diag(σ1, ··· , σm), and U = [u1, ··· , um] ∈ R n×m of M, as a regularization penalty in our learning framework. and V = [v1, ··· , vm] ∈ R are the matrices com- Nuclear norm has been corroborated to be able to encourage posed of left- and right-singular vectors, respectively. It is low-rank matrix structure in the matrix completion litera- obvious that range(X) ≡ range(U), in which we write  ture (Fazel, Hindi, and Boyd 2001; Candes` and Recht 2009; range(A) = span {ai}i as the column space of a given Recht, Fazel, and Parrilo 2010). Moreover, since our frame- matrix A. Significantly, the optimality of problem (3) can work constrains M to be in the positive semidefinite (PSD) be characterized as follows. d cone S+ to make M a valid metric, the nuclear norm kMk∗ Lemma 1. For any optimal solution M? to problem (3), we can be replaced by the trace norm tr(M). have M? ∈ UWU>|W ∈ m . Integrating the angle principle and the trace norm regu- S+ ? d larizer, the objective of our proposed Low-Rank Similarity Lemma 1 reveals that although we are seeking M in S+, Metric Learning (LRSML) framework is formulated as any optimal solution M? must obey the certain low-rank structure UWU> because of the nature of problem (3). This important discovery enables us to first project the data onto n X the low-rank basis U and then seek W via optimizing a new min f(M) := [˜yij − yijSM(xi, xj)]+ + αtr(M), M∈ d problem in Eq. (4) which is at a much smaller scale than the S+ i,j=1 raw problem in Eq. (3). (3) > m Theorem 1. For any i ∈ [1 : n], denote x˜i = U xi ∈ R .  Then M? is an optimal solution to problem (3) if and only if , yij = −1 ? ? > ? where y˜ij = , [x]+ = max(0, x), and α > M = UW U in which W belongs to the set yij, otherwise n 0 is the regularization parameter. Above all, the objective X h > i arg min f˜(W) := y˜ij − yij x˜ Wx˜j + αtr(W). in Eq. (3) is convex and does not introduce any heuris- m i W∈S+ + tic low-rank basis which was required by the prior con- i,j=1 vex method (Davis and Dhillon 2008) and those noncon- (4) vex methods (Goldberger, Roweis, and Salakhutdinov 2004; Theorem 1 indicates that through a simple projection step > Torresani and Lee 2006; Mensink et al. 2013; Lim and (xi → U xi), we are able to reduce the high-dimensional Lanckriet 2014) in low-rank distance metric learning. Sec- metric learning task in Eq. (3) to a lower dimensional met- ond, minimizing the hinge loss in Eq. (3) enforces the ric learning problem in Eq. (4) with guarantees, thereby margin criterion on the similarity function SM so that the greatly reducing the computational costs. However, the scale dissimilar pairs can be obviously distinguished from the of problem (4) could still be not neglectable, which severely similar ones. It is worth clarifying that we do not pursue limits the use of the off-the-shell interior point based solvers SM(xi, xj) = 0 (i.e.,  = 0) in the case of yij = −1 in such as SDPT3 (Toh, Todd, and Tut¨ unc¨ u¨ 1999). In the next Eq. (2) because it may lead to the trivial solution M = 0 to subsection, we will devise an efficient first-order method to 1 Eq. (3). solve problem (4). Due to the space limit, all the proofs of lemmas and theo- Similar projection tricks have been recently exploited and rems presented in this section are placed in the supplemental proven for convex distance metric learning problems reg- material. ularized by squared Frobenius norm (Chatpatanasiri et al. 2010) and several other strictly convex functions (Kulis 1We observed the trivial solution M = 0 in the case of  = 0 2012), whose proof strategies, however, cannot carry over to when the training data examples are fewer and the regularization trace norm since trace norm is not strictly convex. To the best parameter α is relatively large. of our knowledge, the structural representation UW?U>

2794 characterizing optimal metric solutions is firstly explored By applying T1/ρ elementwisely, we obtain a closed-form here for trace norm regularized convex metric learning. solution to the Z subproblem as

Linearized ADMM k+1 ˜ ˜ > k ˜ k  Now we develop an efficient optimization method to exactly Z := T1/ρ Y − Y ◦ (X W X) − Λ /ρ . (8) solve Eq. (4). To facilitate derivations, we define three ma- trix operators: the Hadamard product ◦, the inner-product h·, ·i, and the Frobenius norm k · kF. The Λ-update is prescribed to be We first introduce a dummy variable matrix Z = [zij]ij ∈ Rn×n to equivalently rewrite problem (4) as k+1 k k+1 > k+1  n Λ := Λ + ρ Z − Y˜ + Y ◦ (X˜ W X˜ ) . (9) X min [zij]+ + αtr(W) (5) W∈ m,Z S+ i,j=1 >  s.t. Z − Y˜ + Y ◦ X˜ WX˜ = 0, W-Update in which three constant matrices X˜ = [x˜1, ··· , x˜n] = > m×n n×n ˜ U X ∈ R , Y = [yij]ij ∈ R , and Y = [˜yij]ij ∈ In the standard ADMM framework, W is updated by mini- n×n are involved. Either of the objective function and con- R mizing Lρ(Z, W; Λ) with respect to W while Z and Λ are Z straint in Eq. (5) is separable in terms of two variables fixed. Noticing X˜ X˜ > = ΣV>VΣ = Σ2, we derive the W and , which naturally suggests a use of alternating direc- W-update as follows tion methods. To this end, we employ the increasingly pop- ular Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011) to cope with Eq. (5). ADMM works on Wk+1 := arg min L Zk+1, W; Λk m ρ the following augmented Lagrangian function: W∈S+ D E n = arg min αtr(W) + Λk, Y ◦ X˜ >WX˜  X m Lρ(Z, W; Λ) := [zij]+ + αtr(W) W∈S+ i,j=1 2 ρ k+1 >  D E + Z − Y˜ + Y ◦ X˜ WX˜ + Λ, Z − Y˜ + Y ◦ X˜ >WX˜  2 F ρ ˜ > ˜ 2 2 = arg min X WX + hαI, Wi ρ >  W∈ m F + Z − Y˜ + Y ◦ X˜ WX˜ , (6) S+ 2 2 F D E + Λk + ρZk+1 − ρY˜ , Y ◦ X˜ >WX˜  where Λ = [λ ] ∈ n×n is the multiplier for the lin- ij ij R ρ ear constraint in Eq. (5) (a.k.a. dual variable), and ρ > 0 is = arg min WΣ2, Σ2W + hαI, Wi m the penalty parameter. ADMM will proceed through succes- W∈S+ 2 sively and alternatingly updating the three variable matrices D ˜ k k+1 ˜  ˜ > E Zk, Wk and Λk (k = 0, 1, ··· ). + X Y ◦ (Λ + ρZ − ρY) X , W 1 = arg min WΣ2, Σ2W Z, Λ-Updates m W∈S+ 2 Here we first present the updates for Z and Λ which are D ˜ k+1 ˜ k  ˜ > E easier to achieve than the update for W. + αI/ρ + X Y ◦ (Z − Y + Λ /ρ) X , W The Z-update is achieved by minimizing L (Z, W; Λ) ρ = arg min gW; Zk+1, Λk, (10) with respect to Z while keeping W and Λ fixed, that is, m W∈S+ n k+1 k k X Z := arg min Lρ Z, W ; Λ = arg min [zij]+ Z Z where I denotes the m × m identity matrix. Unfortunately, i,j=1 there is no analytical solution available for this W sub- ρ 2 + Z − Y˜ + Y ◦ X˜ >WkX˜  + Λk/ρ . problem in Eq. (10). Trying iterative optimization methods 2 F to solve Eq. (10) would, however, limit the efficiency of To solve this Z subproblem, we utilize the proximal operator ADMM. In what follows, we develop a linearized modifi- 1 2 cation of the original ADMM framework in Eqs. (8)(10)(9) Tθ(a) = arg minz∈ θ[z]+ + (z − a) (θ > 0) to handle R 2 by rectifying the W-update in Eq. (10) to an easier one. [zij]+. Tθ(a) is a proximal mapping of θ[z]+ and has shown in (Ye, Chen, and Xie 2009) to output as follows Linearization. Rather than dealing with intrinsically k+1 k ( a − θ, a > θ, quadratic g W; Z , Λ , we resort to optimizing a surro- k+1 k Tθ(a) = 0, 0 ≤ a ≤ θ, (7) gate function gbτ W; Z , Λ which is the linear approx- a, a < 0. imation to g and augmented by a quadratic proximal term.

2795 Concretely, the rectified W-update is shown as Algorithm 1 Low-Rank Similarity Metric Learning d×n k+1 k+1 k Input: the data matrix X ∈ R (d  n), the pairwise label matrix W := arg min gτ W; Z , Λ n×n m b Y ∈ {1, −1} , the regularization parameter α > 0, three positive constants W∈S+ , ρ, τ, the budget iteration number T . 1 2 > d×m = arg min W − Wk + gWk; Zk+1, Λk+ SVD Projection: perform the SVD of X = UΣV to obtain U ∈ R , m F Σ ∈ m×m, and compute the projected data X˜ = U>X ∈ m×n. W∈S+ 2τ R+ R   Linearized ADMM: k+1 k k form the matrix Y˜ ∈ {1, }n×n according to Y; + ∇Wg W; Z , Λ , W − W initialize W0 = I , Λ˜ 0 = 0 , S0 = X˜ >X˜ ; W=Wk m×m n×n for k = 1, ··· ,T do 1 2 k k k k ˜ k−1 ˜ k−1 = arg min G , W − W + W − W Z ←− T1/ρ Y − Y ◦ S − Λ , m F k−1 k k−1 > 2 k−1 2 W∈ + 2τ α ˜ ˜ ˜  ˜ S G ←− ρ Im×m +X Y◦(Z −Y+Λ ) X +Σ W Σ , k k−1 k−1 k > k W ←− P m W − τG , S ←− X˜ W X˜ , 1 k k 2 S+ = arg min W − W − τG ˜ k ˜ k−1 k ˜ k W∈ m 2τ F Λ ←− Λ + Z − Y + Y ◦ S , S+ k k 2 2 −12 if Z − Y˜ + Y ◦ S /n ≤ 10 then k k F = P m W − τG . (11) break, S+ end if In Eq. (11) we introduce the matrix end for. k > r×r Output: run the eigen-decomposition of W = HEH in which E ∈ R k k+1 k retains the positive eigenvalues, and then output the optimal low-rank basis L? = G = ∇Wg W; Z , Λ (12) 1/2 d×r ? ? ?> W=Wk UHE ∈ R (the corresponding low-rank metric is M = L L ). = αI/ρ + X˜ Y ◦ (Zk+1 − Y˜ + Λk/ρ)X˜ > + Σ2WkΣ2.

The introduced operator P m (·) represents the projection S+ matrix M; instead, it only needs to store the low-rank ba- m operation onto the PSD cone ; for example, P m (A) = U m × m S+ S+ sis and update the much smaller metric matrix Pm > m×m Wk, as indicated by Theorem 1. In doing so, LRSML cir- i=1[γi]+pipi for any symmetric matrix A ∈ R whose eigenvector-eigenvalue pairs are (p , γ )m . The in- cumvents expensive matrix manipulations such as projecting i i i=1 d troduced parameter τ > 0 is the step size which is to be M onto S+, therefore accomplishing considerable savings in specified later. both storage and computation. The total time complexity of 2 3 The advantage of linearizing gW; Zk+1, Λk is that the LRSML is bounded by O(dn + T n ) which scales linearly resulting W-update in Eq. (11) enjoys an analytical expres- with the input dimensionality d. In practice, we find that the sion and consequently works very efficiently. This modifi- linearized ADMM converges rapidly within T = 1, 000 it- erations under the setting of ρ = 1, τ = 0.01 (see the con- cation of the standard ADMM, i.e., optimizing gbτ instead of g, is the core ingredient of our developed linearized ADMM vergence curves in the supplemental material). framework, namely linearized ADMM. The following theo- rem certifies that the developed linearized ADMM is theo- Discussion retically sound, producing a globally optimal solution to the A primary advantage of our analysis in Lemma 1 and The- raw problem in Eq. (5). orem 1 is avoiding the expensive projection onto the high- d 1 1 dimensional PSD cone S+, which was required by the pre- Theorem 2. Given 0 < τ < ˜ 4 = kXk4 , the sequence kXkop op vious trace norm regularized metric learning methods such  k k k Z , W , Λ k generated by the linearized ADMM in as (McFee and Lanckriet 2010; Lim, McFee, and Lanck- Eqs. (8)(11)(9) starting with any symmetric Z0, W0, Λ0 riet 2013). In the supplemental material, we further provide converges to an optimal solution of problem (5). in-depth theoretic analysis (see Lemma 3 and Theorem 3) to comprehensively justify the low-rank solution structure Note that k · kop denotes the operator norm of matrices. ? ? > 0 0 M = UW U for any convex loss function in terms of For simplicity, we initialize W = Im×m and Λ = 0n×n > xi Mxj regularized by trace norm tr(M) or squared Frobe- to launch the linearized ADMM. 2 nius norm kMkF. As a result, our analysis would directly lead to scalable O(d) algorithms for a task of low-rank dis- Algorithm tance or similarity metric learning supervised by instance- We summarize the proposed Low-Rank Similarity Metric level, pairwise, or listwise label information. For example, Learning (LRSML) approach in Algorithm 1, which con- our analysis would give an O(d)-time algorithm for optimiz- sists of two steps: 1) SVD-based projection, and 2) lower ing the low-rank distance metric learning objective (a hinge dimensional metric learning via the linearized ADMM. In loss based on listwise supervision plus a trace norm regular- the projection step, the SVD of X can be efficiently per- izer) in (McFee and Lanckriet 2010) through following our formed in O(dn2) time by running the eigen-decomposition proposed two-step scheme, SVD projection + lower dimen- over the matrix X>X. If the data vectors in X are zero- sional metric learning. centered, such a decomposition is exactly PCA. In the lin- We remark here that linearized ADMM techniques have earized ADMM step, we use the scaled dual variable matrix been investigated in the optimization community, and ap- Λ˜ k = Λk/ρ. Importantly, we claim that there is no need plied to the problems arising from compressed sensing to explicitly compute and even store the large d × d metric (Yang and Zhang 2011), image processing (Zhang et al.

2796 Table 1: Basic descriptions of four datasets used in the ex- et al. 2013) (MLNCM also uses LSA’s output as its heuris- periments. tic low-rank basis), a similarity learning method Adaptive Regularization Of MAtrix models (AROMA) (Crammer and Dataset # Classes # Samples # Dimensions Chechik 2012), and the Low-Rank Similarity Metric Learn- Reuters-28 28 8,030 18,933 ing (LRSML) method proposed in this paper. Except ML- TDT2-30 30 9,394 36,771 NCM which requires to use the Nearest Class Mean classi- UIUC-Sports 8 1,579 87,040 fier, all the compared metric/similarity learning methods col- UIUC-Scene 15 4,485 21,504 laborate with the 1NN classifier. Note that we choose to run LSA instead of PCA because of the nonnegative and sparse 2010), and nuclear norm regularized convex optimization nature of the used feature vectors. LSA-ITML, CM-ITML, (Yang and Yuan 2013). Nevertheless, the linearized ADMM MLR and LRSML are convex methods, while MLNCM is a method that we developed in this paper is self-motivated in nonconvex method. We choose the most efficient version of the sense that it was derived for solving a particular opti- AROMA, which restricts the matrix M used in the similar- mization problem of low-rank metric learning. To the best ity function to be diagonal. To efficiently implement MLR, of our knowledge, linearized ADMM has not been applied we follow the two-step scheme, which has been justified in to metric learning before, and the analytic-form updates pro- this paper, SVD projection followed by MLR with lower di- duced by our linearized ADMM method are quite advanta- mensionality. geous. Note that (Lim, McFee, and Lanckriet 2013) applied the standard ADMM to solve a subproblem in every iteration Except AROMA which does not impose the PSD con- of their optimization method, but executing ADMM at only straint on M, each of these referred methods can yield a one time costs the cubic time complexity O(T 0d3) (T 0 is low-rank basis L. For them, we try three measures between the iteration number of ADMM) which is computationally two inputs x and x0: (i) distance L>x − L>x0 , (ii) inner- infeasible in high dimensions. By contrast, our linearized > > > > 0 product L>x L>x0, and (iii) cosine (L x) (L x ) . ADMM tackles the entire metric optimization in O(T n3) kL>xkkL>x0k time, performing at least one order of magnitude faster than For LSA, FLDA, LSA-ITML, CM-ITML, MLR and ML- the standard ADMM by virtue of the elegant W-update. NCM, distance and cosine measures are tried; while for LRSML, inner-product and cosine measures are tried since the inner-product measure (ii) is exactly the bilinear similar- Experiments ity function shown in Eq. (1). The baseline “Original” gives We carry out the experiments on four benchmark datasets the same results under the three measures as the feature vec- including two document datasets Reuters-28 and TDT2-30 tors are `2 normalized already. (Cai, He, and Han 2011), and two image datasets UIUC- Sports (Li and Fei-Fei 2007) and UIUC-Scene (Lazebnik, Let C be the number of classes in every dataset. On Schmid, and Ponce 2006). Note that we choose 28 cate- Reuters-28 and TDT2-30, we select 5 × C up to 30 × C gories in the raw Reuters-21578 corpus (Cai, He, and Han samples for training such that each category covers at least 2011) such that each category has no less than 20 exam- one sample; we pick up the same number of samples for ples. In Reuters-28 and TDT2-30, each document is repre- cross-validation; the rest of samples are for testing. We re- sented by an `2 normalized TF-IDF feature vector; in UIUC- peat 20 times of random training/validation/test splits, and Sports and UIUC-Scene, each image is represented by an `2 then report the average classification error rates and training normalized sparse-coding feature vector (Wang et al. 2010). time for all the competing methods in Tables 2 and 3. On The basic information about these four datasets is shown in the two image datasets, we follow the commonly used eval- Table 1. uation protocols like (Lazebnik, Schmid, and Ponce 2006; The baseline method is named as “Original”, which Wang et al. 2010). On UIUC-Sports, we select 10 to 70 takes original feature vectors for 1NN classification, and samples per class for training, a half number of samples for the linear SVM is also included in comparison. Here we validation, and the remaining ones for testing; on UIUC- evaluate and compare eight metric and similarity learning Scene, we obey the similar setting and the training sam- methods including: two classical linear dimensionality re- ples range from 10 to 100 per class. By running 20 times, duction methods Latent Semantic Analysis (LSA) (Deer- the average of per-class recognition rates as well as the wester et al. 1990) and Fisher Linear Discriminant Analy- training time are reported in the tables of the supplemen- sis (FLDA) (Hastie, Tibshirani, and Friedman 2009), two tal material. To run our proposed method LRSML, we fix low-rank variants LSA-ITML and CM-ITML (Davis and  = 0.1, ρ = 1, and find that τ = 0.01 makes the lin- Dhillon 2008) of a representative distance metric learn- earized ADMM converge within T = 1, 000 iterations on all ing method Information-Theoretic Metric Learning (ITML) datasets. The results shown in all the referred tables demon- (Davis et al. 2007) (LSA-ITML uses LSA’s output as its strate that LRSML consistently leads to the highest accuracy heuristic low-rank basis, while CM-ITML uses the class in terms of 1NN classification. Across all datasets, the best means as its heuristic low-rank basis), two recent low-rank two results are achieved by LRSML with two measures; the distance metric learning methods Metric cosine measure almost results in slightly higher classifica- (MLR) (McFee and Lanckriet 2010) and Metric Learning tion accuracy than the inner-product measure. More experi- for the Nearest Class Mean classifier (MLNCM) (Mensink mental results are presented in the supplemental material.

2797 Conclusions Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007. Information-theoretic metric learning. In Proc. ICML. The proposed LRSML formulated a trace norm regularized convex objective which was proven to yield a globally opti- Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, G. W.; mal solution with a certain low-rank structure. With the op- and Harshman, R. A. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391– timality guarantee, the challenging high-dimensional metric 407. learning task can be reduced to a lower dimensional metric Fazel, M.; Hindi, H.; and Boyd, S. P. 2001. A rank minimiza- learning problem after a simple SVD projection. The lin- tion heuristic with application to minimum order system approx- earized ADMM was further developed to efficiently solve imation. In Proc. the American Control Conference, volume 6, the reduced problem. Our approach bears out to maintain 4734–4739. linear space and time complexities in the input dimension- Globerson, A., and Roweis, S. 2005. Metric learning by collapsing ality, consequently scalable to high-dimensional data do- classes. In NIPS 18. mains. Through extensive experiments carried out on four Goldberger, J.; Roweis, S.; and Salakhutdinov, R. 2004. Neigh- benchmark datasets with up to 87,000 dimensions, we cer- bourhood components analysis. In NIPS 17. tify that our learned low-rank similarity metrics are well suited to high-dimensional problems and exhibit prominent Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. performance gains in kNN classification. In future work, we Second Edition, Springer. would like to investigate generalization properties of the pro- Hoi, S. C.; Liu, W.; and Chang, S.-F. 2010. Semi-supervised dis- posed LRSML like (Cao, Guo, and Ying 2012), and aim tance metric learning for collaborative image retrieval and cluster- to accomplish a tighter generalization bound for trace norm ing. ACM Transactions on Multimedia Computing, Communica- regularized convex metric learning. tions and Applications 6(3). Kar, P., and Jain, P. 2011. Similarity-based learning via data driven Acknowledgments embeddings. In NIPS 24. When doing an earlier version of this work, Wei Liu was Kar, P., and Jain, P. 2012. with similarity supported by Josef Raviv Memorial Postdoctoral Fellow- functions. In NIPS 25. ship. The authors would like to thank Prof. Gang Hua and Kriegel, H.-P.; Schubert, M.; and Zimek, A. 2008. Angle-based Dr. Fuxin Li for constructive suggestions which are helpful outlier detection in high-dimensional data. In Proc. KDD. to making the final version. Kulis, B. 2012. Metric learning: a survey. Foundations and Trendsr in Machine Learning 5(4):287–364. References Kunapuli, G., and Shavlik, J. 2012. Mirror descent for metric learning: A unified approach. In Proc. ECML. Bar-Hillel, A.; Hertz, T.; Shental, N.; and Weinshall, D. 2005. Learning a mahalanobis metric from equivalence constraints. Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond bags of JMLR 6:937–965. features: Spatial pyramid matching for recognizing natural scene categories. In Proc. CVPR. Bellet, A.; Habrard, A.; and Sebban, M. 2013. A survey on metric Li, L.-J., and Fei-Fei, L. 2007. What, where and who? classifying learning for feature vectors and structured data. arXiv:1306.6709. event by scene and object recognition. In Proc. ICCV. Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Lim, D. K. H., and Lanckriet, G. 2014. Efficient learning of maha- Distributed optimization and statistical learning via the alternating lanobis metrics for ranking. In Proc. ICML. direction method of multipliers. Foundations and Trendsr in Ma- chine Learning 3(1):1–122. Lim, D. K. H.; McFee, B.; and Lanckriet, G. 2013. Robust struc- tural metric learning. In Proc. ICML. Cai, D.; He, X.; and Han, J. 2011. Locally consistent concept fac- torization for document clustering. IEEE Transactions on Knowl- Liu, W.; Ma, S.; Tao, D.; Liu, J.; and Liu, P. 2010a. Semi- edge and Data Engineering 23(6):902–913. supervised sparse metric learning using alternating linearization optimization. In Proc. KDD. Candes,` E. J., and Recht, B. 2009. Exact matrix completion via convex optimization. Foundations of Computational mathematics Liu, W.; Tian, X.; Tao, D.; and Liu, J. 2010b. Constrained metric 9(6):717–772. learning via distance gap maximization. In Proc. AAAI. Cao, Q.; Guo, Z.-C.; and Ying, Y. 2012. Generalization bounds for McFee, B., and Lanckriet, G. 2010. Metric learning to rank. In Proc. ICML metric and similarity learning. Technical report. . Mensink, T.; Verbeek, J.; Perronnin, F.; and Csurka, G. 2013. Chatpatanasiri, R.; Korsrilabutr, T.; Tangchanachaianan, P.; and Ki- Distance-based image classification: Generalizing to new classes jsirikul, B. 2010. A new kernelization framework for mahalanobis at near zero cost. TPAMI 35(11):2624–2637. distance learning algorithms. Neurocomputing 73(10):1570–1579. Niu, G.; Dai, B.; Yamada, M.; and Sugiyama, M. 2012. Chechik, G.; Sharma, V.; Shalit, U.; and Bengio, S. 2010. Large Information-theoretic semi-supervised metric learning via entropy JMLR scale online learning of image similarity through ranking. regularization. In Proc. ICML. 11:1109–1135. Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed Cheng, L. 2013. Riemannian similarity learning. In Proc. ICML. minimum-rank solutions of linear matrix equations via nuclear Crammer, K., and Chechik, G. 2012. Adaptive regularization for norm minimization. SIAM review 52(3):471–501. weight matrices. In Proc. ICML. Shen, C.; Kim, J.; Wang, L.; and van den Hengel, A. 2012. Positive Davis, J. V., and Dhillon, I. S. 2008. Structured metric learning for semidefinite metric learning using boosting-like algorithms. JMLR high dimensional problems. In Proc. KDD. 13:1007–1036.

2798 Table 2: Reuters-28 dataset: classification error rates and training time of eight competing metric/similarity learning methods. 5×28 training samples 30×28 training samples Method Measure Error Rate Train Time Measure Error Rate Train Time (%) (sec) (%) (sec) Original – 23.94±1.63 – – 16.56±0.55 – Linear SVM – 21.25±1.46 0.03 – 9.16±0.43 0.12 distance 42.21 ±4.48 distance 24.56±5.30 LSA 0.02 0.55 cosine 21.99±1.89 cosine 11.85±0.48 distance 39.46±6.10 distance 15.19±1.38 FLDA 0.03 0.75 cosine 21.34±3.07 cosine 10.05±0.65 distance 20.17±0.95 distance 10.06±0.99 LSA-ITML 47.2 6158.0 cosine 21.29±1.02 cosine 9.06±0.45 distance 33.38±3.57 distance 18.18±0.74 CM-ITML 7.3 219.3 cosine 23.49±1.42 cosine 13.28±0.76 distance 27.08±2.75 distance 13.29±1.14 MLR 17.8 1805.7 cosine 20.22±1.44 cosine 9.86±0.53 distance 21.07±1.62 distance 12.62±0.55 MLNCM 1.1 138.5 cosine 22.88±2.52 cosine 13.51±0.79 AROMA similarity 21.25±1.53 33.1 similarity 13.49±0.63 2328.1 inner-product 15.56±1.55 inner-product 8.84±0.61 LRSML 9.1 676.0 cosine 14.83±1.20 cosine 7.29±0.40

Table 3: TDT2-30 dataset: classification error rates and training time of eight competing metric/similarity learning methods. 5×30 training samples 30×30 training samples Method Measure Error Rate Train Time Measure Error Rate Train Time (%) (sec) (%) (sec) Original – 10.12±1.24 – – 6.75±0.39 – Linear SVM – 9.80±2.46 0.06 – 2.59±0.24 0.37 distance 30.01±10.14 distance 8.52±1.19 LSA 0.02 0.68 cosine 8.19±1.49 cosine 4.26±0.44 distance 36.57±4.80 distance 7.50±0.62 FLDA 0.04 1.08 cosine 8.77±1.96 cosine 3.56±0.34 distance 11.35±2.99 distance 5.68±0.76 LSA-ITML 35.7 1027.4 cosine 5.85±1.14 cosine 2.84±0.33 distance 32.56±4.87 distance 7.67±0.71 CM-ITML 7.8 297.7 cosine 8.75±2.48 cosine 3.42±0.36 distance 18.50±3.51 distance 7.50±0.75 MLR 30.1 1036.4 cosine 8.04±1.11 cosine 3.84±0.29 distance 8.48±1.38 distance 4.69±0.32 MLNCM 1.5 231.5 cosine 8.58±1.64 cosine 4.39±0.38 AROMA similarity 11.31±2.55 35.6 similarity 4.27±0.41 2655.6 inner-product 4.74±0.83 inner-product 2.37±0.20 LRSML 10.2 813.9 cosine 4.95±1.06 cosine 2.27±0.15

Toh, K. C.; Todd, M. J.; and Tut¨ unc¨ u,¨ R. H. 1999. Sdpt3 - a mat- Yang, J., and Yuan, X. 2013. Linearized augmented lagrangian lab software package for semidefinite programming, version 1.3. and alternating direction methods for nuclear norm minimization. Optimization Methods and Software 11(1-4):545–581. Mathematics of Computation 82(281):301–329. Torresani, L., and Lee, K.-C. 2006. Large margin component anal- Yang, J., and Zhang, Y. 2011. Alternating direction algorithms for ysis. In NIPS 19. `1 problems in compressive sensing. SIAM Journal on Scientific Computing 33(1):250–278. Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; and Gong, Y. 2010. Locality-constrained linear coding for image classification. In Ye, G.-B.; Chen, Y.; and Xie, X. 2009. Efficient variable selection Proc. CVPR. in support vector machines via the alternating direction method of multipliers. In Proc. AISTATS. Weinberger, K. Q., and Saul, L. K. 2009. Distance metric learning Ying, Y., and Li, P. 2012. Distance metric learning with eigenvalue for large margin nearest neighbor classification. JMLR 10:207– optimization. JMLR 13:1–26. 244. Ying, Y.; Huang, K.; and Campbell, C. 2009. Sparse metric learn- Wu, L.; Jin, R.; Hoi, S. C. H.; Zhu, J.; and Yu, N. 2009. Learning ing via smooth optimization. In NIPS 22. bregman distance functions and its application for semi-supervised clustering. In NIPS 22. Zhang, X.; Burger, M.; Bresson, X.; and Osher, S. 2010. Breg- manized nonlocal regularization for deconvolution and sparse re- Xing, E. P.; Ng, A. Y.; Jordan, M. I.; and Russell, S. 2002. construction. SIAM Journal on Imaging Sciences 3(3):253–276. Distance metric learning with application to clustering with side- information. In NIPS 15.

2799