METRIC LEARNING WITH RANK AND SPARSITY CONSTRAINTS

Bubacarr Bah† Stephen Becker? Volkan Cevher† Baran Gozc¨ u¨†

† Laboratory for Information and Inference Systems, EPFL, Switzerland ? IBM Research, Yorktown Heights, USA

ABSTRACT Even though (1) is a convex program, its numerical solution Choosing a distance preserving measure or metric is fun- proves to be challenging since it does not fit into the stan- damental to many signal processing algorithms, such as k- dard convex optimization formulations, such as semidefinite means, nearest neighbor searches, hashing, and compressive or quadratic programming. Moreover, as the number of vari- sensing. In virtually all these applications, the efficiency ables is quadratic with the number of features, even the most of the signal processing algorithm depends on how fast we basic gradient-only approaches do not scale well [1, 4]. can evaluate the learned metric. Moreover, storing the chosen In general, the solution of (1) results in metrics that are metric can create space bottlenecks in high dimensional signal full rank. Hence, the learned metric creates storage as well processing problems. As a result, we consider data dependent as computational bottlenecks. As a result, several works [2, metric learning with rank as well as sparsity constraints. We 3, 5, 6] consider learning rank-constrained metrics. Unfortu- propose a new non-convex algorithm and empirically demon- nately, enforcing rank constraints on the already computation- strate its performance on various datasets; a side benefit is ally challenging MLP (1) makes it non-convex. Surprisingly, that it is also much faster than existing approaches. The under certain conditions, it is possible to prove approxima- added sparsity constraints significantly improve the speed of tion quality and the generalization of solution algorithms in multiplying with the learned metrics without sacrificing their the high-dimensional limit [3, 5]. quality. In this paper, we reformulate the standard MLP (1) into Index Terms— Metric learning, Nesterov acceleration, a non-convex optimization framework that can incorporate sparsity, low-rank, proximal gradient methods sparsity constraints in addition to the rank. That is, we learn distance metrics B = AAT such that A ∈ RN×r for a 1. INTRODUCTION given r  N and A is sparse. A sparse A has computa- tional benefits like low storage and computational complex- Learning “good distance” metrics for signals is key in real- ity. Consequently, this work could be useful in sparse low- world applications, such as data classification and retrieval. rank factorization which has numerous applications in For instance, in the k-nearest-neighbor (KNN) classifier, we including learning [7] and deep neural net- identify the k-nearest labeled images given a test image in the works () [8] and autoencoding. This work is also space of visual features. Hence, it is important to learn met- related to optimizing projection matrices introduced in [9]. rics that capture the similarity as well as the dissimilarity of Our approach can also incorporate additional convex con- datasets. Interestingly, several works have shown that prop- straints on A. To illustrate our algorithm, we use a symmet- erly chosen distance metrics can significantly benefit KNN ric and non-smooth version of the MLP (1) in the manner accuracy as compared to the usual Euclidean metric [1–3]. of [4, 5]. We then use the Nesterov smoothing approach to In the standard metric√ learning problem (MLP) we learn obtain a smooth cost that has approximation guarantees to the T N a (semi-) norm kxkB = x Bx that respects data x ∈ R original problem, followed by Burer-Monteiro splitting [10] dissimilarity constraints on a set D while obtaining the best with quasi-Newton enhancements. Even without the sparsity similarity on a set S: constraints, our algorithmic approach is novel and leads to X 2 improved results over the previous state-of-the-art. min kxi − xjkB B0 (xi,xj )∈S The paper is organized as follows. Section 2 sets up the X (1) notation used in the paper and introduces the needed defini- subject to kx − x k2 ≥ 1. i j B tions; while Section 3 describes the problem and its reformu- (xi,xj )∈D lation into an appropriate optimization framework. Section This work was supported in part by the European Commission un- 4 states the algorithm and its theoretical background; and its der Grant MIRG-268398, ERC Future Proof, SNF 200021-132548, SNF followed by Section 5 which showcases the experimental re- 200021-146750 and SNF CRSII2-147633. The author names are in alpha- betical order. sults. Section 6 concludes the paper. N×N M 2. PRELIMINARIES Let us define a linear transform A : S+ → R as T  Notation: We denote integer scalars by lowercase letters A (B) := diag V BV , where diag(H) denotes a vector (e.g., i, k, r, p), real scalars by lowercase Greek letters (e.g., of the entries of the principal diagonal of the matrix H. α, δ, λ), sets by uppercase calligraphic letters (e.g., S), vec- Learning the B that minimizes δ for our symmetric MLP tors by lowercase boldface letter (e.g., x) and matrices by (1) therefore becomes the following non-convex problem: N×N uppercase boldface letter (e.g., X). We denote S+ as the min kA (B) − 1M k∞ set of positive semidefinite (PSD) matrices. The usual `1 B (6) norm and `0 pseudo-norm (number of non-zero entries) are subject to B  0, rank(B) = r. extended to matrices by reshaping the matrix to a vector. In Section 3, we reformulate (1) into a problem that is Instead of learning B directly, we instead opt to learn tightly connected with learning Johnson-Lindenstrauss (JL) its factors, i.e., B = AAT , which is also known as Burer- embeddings or restricted isometry property (RIP) dimension- Monteiro splitting [10]: Φ ∈ r×N p ality reduction. A matrix R is a JL embedding of T  p min kA AA − 1M k∞ (7) points, X := {xi} , if r = O (log p) and there exists a pos- N×r i=1 A∈R itive constant δ > 0 such that the following relations holds: The advantages of the non-convex formulation (7) is two fold: 2 2 2 (1−δ)kxi−xjk2 ≤ kΦ(xi−xj)k2 ≤ (1+δ)kxi−xjk2, (2) (i) it reduces storage space since the optimization variable lives in a much lower dimensional space (i.e., N × r  N × for every x , x ∈ X , x 6= x [11, 12]. Using the definition i j i j N), and (ii) it enables us to add additional constraints and of the metric in Section 1, we can rewrite (2) as follows: regularizers on the factors A directly. 2 2 2 (1 − δ)kxi − xjk2 ≤ kxi − xjkB ≤ (1 + δ)kxi − xjk2, (3) For instance, while the rank constraints can be achieved by constraining that the dimension of A be N × r, we can where B = ΦT Φ. Without loss of generality, to simplify the also consider adding an ` -norm constraint as well as an ` - algebra, we are going to enforce the similarity or dissimilarity 0 1 regularizer term (as in [15]) to have the following more gen- constraints on normalized differences of data points: eral formulation. Definition 1. Given X ⊂ N we define the set of secants R T  min kA AA − 1M k∞ + λkAk1 vectors of points xi, xj ∈ X with xi 6= xj as: N×r A∈R (8)   xi − xj subject to: kAk0 ≤ σ. S(X ) := vij = . (4) kxi − xjk2 The `0-norm constraint enables us to specify a priori the spar- sity of the output of our algorithm. It is also possible to add a 3. PROBLEM DESCRIPTION constraint on the Frobenius norm of A directly. In this section, we set up the basic optimization problem that Note that the approach in [4] solves a convex relaxation reveals the computational difficulty of (1). In general, the of (6), where the rank constraint is replaced by a nuclear norm MLP (1) considers relationships between points based on constraint. That solution works in the N × N space and their pairwise distances. Hence, we would require that the requires eigendecompositions. In contrast, we do not do a metric B preserves the pairwise distances of the points in D eigendecomposition in our approach to recover A. Moreover, up to a distortion parameter δ as in (3) to yield a more strin- we can strictly enforce rank constraints while minimizing δ. gent constraint for (1) while ignoring this requirement for Unfortunately, our problem (8) is not smooth and hence points in S. However, we set up a symmetric problem which we only have access to a subgradient of the objective. Instead, then uses RIP in the manner of [4, 5] that can be adjusted we consider the following smoothed version: depending on the individual application. T  min f(A AA − 1M ) + λkAk1 N×r In this symmetric formulation, using secant vectors, (3) A∈R (9) simplifies to |vT Bv − 1| ≤ δ for x , x ∈ X with x 6= x . ij ij i j i j subject to: kAk0 ≤ σ. p Re-indexing the vij to vl for l = 1,...,M, where M = 2 , 2 we form the set of M secant vectors S(X ) = {v1,..., vM } The simplest choice of a smoothing function is f(z) = kzk2 into an N × M matrix V = [v1,..., vM ]. Then we learn a which can be interpreted as penalizing the average δ for all se- T B that minimizes |vl Bvl − 1| (for a slight abuse of notation cant vectors. In this paper, we focus on the smoothing func- PM zi/µ −zi/µ we will refer to this quantity as δ) over all vl ∈ S(X ). It is tion choice f(z) = fµ(z) = µ log( i=1 e + e ) M known [13,14] that the rank of B can be bounded as follows: since ∀ z ∈ R , limµ→0 fµ(z) = kzk∞. Furthermore, fµ C∞ M z &p ' is on R and its gradient (in ) is Lipschitz continuous 8|S(X )| + 1 − 1 with constant µ−1 (e.g., [16]), though unfortunately the gra- rank(B) ≤ . (5) T 2 dient is not Lipschitz in A (when z = A(AA ) − 1M ). + 4. PROXIMAL GRADIENT METHOD converges to maxi zi as µ → 0 . To approximate kzk∞ we use lse(D(z)) where D(z) = (z, −z) (with adjoint The proximal gradient method (aka forward-backward split- ∗ 2M (D (w))i = wi − wi+M for w ∈ R ). Altogether, we ting method) generalizes the basic projected gradient method. have This method is used on problems of the type: f(A) = lse(D(A(AAT ) − b)). min f(A) + φ(A) (10) Note that the adjoint of A(z) is A∗(z) = Vdiag(z)VT , and A P xi0 /µ −1 xi/µ (∇ lse)i = ( i0 e ) e . Viewing f as the composi- under the assumption that f has a gradient. By allowing φ tion of four functions (lse,D, A(·) − b and A 7→ AAT ) and to take on infinite values, this can model constraints. For ex- using the chain rule gives ample, the constraint A ∈ C is modeled using the indica- ( ∗ T 0 A ∈ C ∇f(A) = 2Vdiag(D (z))V A (11) tor function ιC (A) = . Our main tool is the +∞ A ∈/ C and z = ∇ lse(D(A(AAT ) − b)). proximity operator “prox”, defined as: Computational Complexity. Compared to previous ap- Definition 2. For a fixed τ > 0, the proximity operator of a proaches, a major benefit of our approach is much better com- function φ is the map putational complexity if implemented carefully. The major 1 computational bottleneck of the entire algorithm is in com- prox (Y) ∈ argmin τφ(A) + kA − Yk2. puting A(B) and in A∗(z). To be efficient, we (i) exploit the τφ(·) 2 2 A fact that B = AAT , and (ii) never explicitly form A∗(z) as The prox is unique and non-expansive if φ is convex, a matrix but rather treat this as an implicit operator that acts proper and lower semi-continuous. Note that this reduces to on other matrices. the projection onto C when φ = ιC . The proximal gradient Specifically, A(B) = diag(VT AAT V) = diag(V¯ T V¯ ) method is listed in Algorithm 1; see [17] for variants, appli- for V¯ = AT V. This simplifies to taking the norms of the cations and convergence results. We also include a Nesterov rows of V¯ , and altogether requires O(rNM) flops, com- accelerated variant due to [18] that requires almost no extra pared to naively computing B = AAT and then taking computation and has much faster empirical convergence. A(B) which requires O(MN 2) flops (recall r  N  M). Algorithm 1 General proximal gradient method with Nes- For A∗, we appeal to (11) directly and compute VT A and terov acceleration then compute the rest of the multiplies, and again we re- 2 Require: stepsize τ, prox function proxφ, gradient function quire O(rNM) flops, compared to O(N M) naively. Meth- ∇f(·), initial point A0 ods based on convex relaxations do not see these numerical 1: Y ← A0 speed-ups since their variables B are generally rank N not r. k 2: If using Nesterov acceleration, αk ≡ k+3 , otherwise, Furthermore, at every iteration these convex methods require 3 αk ≡ 0 SVDs of B which can cost up to O(N ). 3: for k = 1, 2,... do Computing the proximity operator. By making use of 4: Ak+1 = proxτφ(·)(Y − τ∇f(Y)) the indicator function we can write the non-smooth portion of 5: Y = Ak+1 + αk(Ak+1 − Ak) (9) as φ(A) = λkAk1 + ι{A|kAk0≤σ}. In the special case 6: end for when λ = 0, the proximity operator of φ is just the (possibly non-unique) projection onto the top σ largest (in magnitude) Convergence. When ∇f is Lipschitz continuous with entries. In the other special case when σ ≥ Nr so that the parameter L, the step-size is τ ≤ 1/L, and if both f and φ are ` term constraint has no effect, the proximity operator of φ convex, then the sequence (A ) converges to a global mini- 0 k is just soft-thresholding. In the general case, the proximity mizer of (10). Unfortunately, both f and φ in our model (9) operator at the point z is calculated by first soft-thresholding z are non-convex and ∇f is not Lipschitz. However, if (A ) k to get zˆ and then calculating λτ|z |+ 1 (ˆz −z )2 and choosing is bounded, then ∇f is Lipschitz restricted to this set, and i 2 i i the top σ components that minimize this (i.e., by sorting). It also by the boundedness of the sequence and by some tech- is even possible to avoid a sort, but implementation details are nical regularity of the function, the work of [19] guarantees unimportant since this step is much cheaper than computing convergence to a local stationary point. In this sense, if the the gradient. algorithm has converged, we can a posteriori use the bound- edness of the sequence to show that the limit point is a local Without sparsity constraints. If we drop the sparsity stationary point. This is a slightly different guarantee than constraint kAk0 ≤ σ and the `1 regularizer, the objective that proved in [20]. is smooth and unconstrained. In this case, our code uses L- BFGS [21] as implemented in minFunc package [22] and Computing the gradient. We restrict now this anal- using the gradient described above. Our experiments show ysis to the log-sum-exp case since the derivation for the L-BFGS is slightly faster than the Nesterov accelerated first- quadratic case is even simpler. Define the log-sum-exp order algorithm, and it also has the advantage that it does not function lse(z) = µ log(P ezi/µ) which, for a fixed z, i require an accurate initial step-size estimate. 5. EXPERIMENTAL RESULTS that either version of FAML gives smaller δ in shorter time We determine the quality of our approximate solution to the than NuMax and NuMax CG. Similarly, given an input rank, MLP (1) by how small δ is. In the first set of experiments FAML converges faster than NuMax and NuMax CG, as we investigate how δ varies with the rank of the matrix we shown in the right panel of Figure 2. learn using both a set of synthetic data and a set of images of motorbikes [23]. The synthetic data set is the manifold data set used in [4], composed of translating white squares in a black background. We generate manifold images sizes of 40 × 40 pixels and resize grayscale images of the motorbikes to also be 40 × 40 pixels, resulting in points of dimension N = 1600. We call the implementation of Algorithm 1 for metric learning the Fast Adaptive Metric Learning (FAML) algo- rithm. Using FAML we learn sparse (with sparsity σ fixed at 10%, and λ = 0) and dense factors of a distance metric X Fig. 2. Speed comparison in terms of time taken to get a target of the secant vectors of these points; we initialize the dense rank ( or equivalently for a target δ) Left panel: a comparison version with the PCA matrix, and the sparse version using the with Gaussian and PCA, Right panel: a comparison with Nu- dense solution. Figure 1 shows how δ varies with the rank Max and NuMax CG. (number of rows) of the dense (“FAML - dense”) and sparse We also explore the high-N case by taking the full resolu- (“FAML - sparse”) factors of the matrix we learned, and com- tion motorbike images from [23], so the dimension becomes pares to a random projection matrix (“Gaussian”) and a PCA N = 163 × 261 = 42543. We select 50 points and gener- metric of the same data sets (“PCA”). Both our sparse and ate all 1225 possible secants. Figure 3 shows we can achieve dense matrices have smaller δ. PCA is performed in a scal- moderate δ for low ranks, and do much better than PCA. The able way using a randomized SVD [24]. NuMax and NuMax CG algorithms were run on this data but We also compare the matrices we learn to those matrices did not work because they require forming N × N dense ma- learned using NuMax and NuMax CG [4] in terms of small- trices which do not fit in the 8 GB of RAM of the computer. ness of rank and δ. Precisely, we fixed a rank and run FAML Combined with the results of Figure 1, this suggests FAML to obtain a δ, then use this δ as an input for NuMax and Nu- outperforms NuMax when N is large or r is small. Max CG and record the rank they output. The right panel plot of Figure 1 show that both versions of FAML outperform both 2 Gaussian versions of NuMax in the low-rank regime, whereas NuMax PCA 1.5 FAML − Dense does better when the rank r is larger. Note that it is always FAML − Sparse possible to initialize FAML-dense using the NuMax or Nu- δ 1 Max CG solution if we are willing to spend extra time.

0.5

0 0 10 20 30 40 Rank

Fig. 3. Motorbike data with N = 42543. 6. CONCLUSION We presented an optimization formulation of the metric learn- ing problem that can handle sparsity and rank constraints. The enforcement of sparsity appears to be novel and may have im- pact in applications that require sparse matrices (e.g., Low- Fig. 1. Plots of the relationship between the δ (mean over 10 Density Parity Check codes) for speed or hardware imple- trials) and the rank of the matrices we learn for 2775 secants, mentation reasons. Our code is low-memory due to careful with N fixed. Using the motorbike images of N = 40 × construction of the algorithm and implementation, and if it 40 Left panel: a comparison with Gaussian and PCA, Right converges, it does so rapidly due to Nesterov acceleration and panel: a comparison with NuMax and NuMax CG. L-BFGS. In either the low-rank or high dimension regime, the method outperforms the NuMax algorithms. For further In the above-mentioned experiment we compare the time research we will apply NuMax CG’s column generation tech- take by FAML to NuMax. The left panel of Figure 2 shows niques in order to handle more secant vectors. 7. REFERENCES [13] A. I. Barvinok, “Problems of distance geometry and convex properties of quadratic maps,” Discrete & Com- [1] L. Yang and R. Jin, “Distance metric learning: A com- putational Geometry, vol. 13, no. 1, pp. 189–202, 1995. prehensive survey,” Michigan State Universiy, pp. 1–51, [14] G. Pataki, “On the rank of extreme matrices in semidef- 2006. inite programs and the multiplicity of optimal eigenval- [2] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, ues,” of Operations Research, vol. 23, no. “Information-theoretic metric learning,” in Interna- 2, pp. 339–358, 1998. tional Conf. Machine Learning (ICML), Corvallis, OR, [15] A. Kyrillidis and V. Cevher, “Combinatorial selection 2007. and least absolute shrinkage via the CLASH algorithm,” [3] B. Kulis, M. Sustik, and I. Dhillon, “Learning low-rank in IEEE Intl. Symp. . IEEE, 2012, kernel matrices,” in Proceedings of the 23rd Interna- pp. 2216–2220. tional conference on Machine learning (ICML). ACM, [16] A. Beck and M. Teboulle, “Smoothing and first order 2006, pp. 505–512. methods: A unified framework,” SIAM J. Optimization, [4] C. Hegde, A.C. Sankaranarayanan, W. Yin, and R.G. vol. 22, pp. 557–580, 2012. Baraniuk, “A convex approach for learning near- [17] P. L. Combettes and V. R. Wajs, “Signal recovery by isometric linear embeddings,” preparation, August, proximal forward-backward splitting,” SIAM Multiscale 2012. Model. Simul., vol. 4, no. 4, pp. 1168–1200, 2005. [5] A. Sadeghian, B. Bah, and V. Cevher, “Energy-aware [18] A. Beck and M. Teboulle, “A fast iterative shrinkage- adaptive bi-Lipschitz embeddings,” in 10th Interna- thresholding algorithm for linear inverse problems,” tional Conference on Sampling Theory and Applications SIAM J. on Imaging Sci., vol. 2, no. 1, pp. 183–202, (SampTA). 2013, EURASIP. 2009. [6] E. Grant, C. Hegde, and P. Indyk, “Nearly optimal linear [19] H. Attouch, J. Bolte, and B.F. Svaiter, “Convergence of embeddings into very low dimensions,” in IEEE Glob- descent methods for semi-algebraic and tame problems: alSIP Symposium on Sensing and Statistical Inference, proximal algorithms, forward–backward splitting, and December 2013. regularized Gauss-Seidel methods,” Math. Prog., pp. 1– 39, 2011. [7] A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk, “Sparse factor analysis for learning and content analyt- [20] Z. Wen, W. Yin, and Y. Zhang, “Solving a low-rank ics,” arXiv preprint arXiv:1303.5685, 2013. factorization model for matrix completion by a nonlin- ear successive over-relaxation algorithm,” Math. Prog. [8] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, Comp., pp. 1–29, 2010. and B. Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional [21] J. Nocedal, “Updating quasi-Newton matrices with lim- output targets,” in IEEE Intl Conf. Acoustics, Speech ited storage,” Math. Comp., vol. 25, no. 151, pp. 773– and Signal Processing (ICASSP). IEEE, 2013, pp. 782, 1980. 6655–6659. [22] M. Schmidt, minFunc software package, 2012, [9] M. Elad, “Optimized projections for compressed sens- http://www.di.ens.fr/˜mschmidt/ ing,” IEEE Trans. Signal Processing, vol. 55, no. 12, pp. Software/minFunc.html. 5695–5702, 2007. [23] L. Fei-Fei, R. Fergus, and P. Perona, “Learning gener- [10] S. Burer and R.D.C. Monteiro, “Local minima and con- ative visual models from few training examples: an in- vergence in low-rank semidefinite programming,” Math. cremental Bayesian approach tested on 101 object cate- Prog. (series A), vol. 103, no. 3, pp. 427–444, 2005. gories,” in IEEE CVPR 2004, Workshop on Generative- Model Based Vision, 2004. [11] W. B. Johnson and J. Lindenstrauss, “Extensions of Lip- schitz mappings into a Hilbert space,” Contemporary [24] N. Halko, P.G. Martinsson, and J.A. Tropp, “Find- mathematics, vol. 26, no. 189-206, pp. 1, 1984. ing structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” [12] E. J. Candes and T. Tao, “Decoding by linear program- SIAM review, vol. 53, no. 2, pp. 217–288, 2011. ming,” Information Theory, IEEE Transactions on, vol. 51, no. 12, pp. 4203–4215, 2005.