Metric Learning with Rank and Sparsity Constraints

METRIC LEARNING WITH RANK AND SPARSITY CONSTRAINTS Bubacarr Bahy Stephen Becker? Volkan Cevhery Baran Gozc¨ uÿ y Laboratory for Information and Inference Systems, EPFL, Switzerland ? IBM Research, Yorktown Heights, USA ABSTRACT Even though (1) is a convex program, its numerical solution Choosing a distance preserving measure or metric is fun- proves to be challenging since it does not fit into the stan- damental to many signal processing algorithms, such as k- dard convex optimization formulations, such as semidefinite means, nearest neighbor searches, hashing, and compressive or quadratic programming. Moreover, as the number of vari- sensing. In virtually all these applications, the efficiency ables is quadratic with the number of features, even the most of the signal processing algorithm depends on how fast we basic gradient-only approaches do not scale well [1, 4]. can evaluate the learned metric. Moreover, storing the chosen In general, the solution of (1) results in metrics that are metric can create space bottlenecks in high dimensional signal full rank. Hence, the learned metric creates storage as well processing problems. As a result, we consider data dependent as computational bottlenecks. As a result, several works [2, metric learning with rank as well as sparsity constraints. We 3, 5, 6] consider learning rank-constrained metrics. Unfortu- propose a new non-convex algorithm and empirically demon- nately, enforcing rank constraints on the already computation- strate its performance on various datasets; a side benefit is ally challenging MLP (1) makes it non-convex. Surprisingly, that it is also much faster than existing approaches. The under certain conditions, it is possible to prove approxima- added sparsity constraints significantly improve the speed of tion quality and the generalization of solution algorithms in multiplying with the learned metrics without sacrificing their the high-dimensional limit [3, 5]. quality. In this paper, we reformulate the standard MLP (1) into Index Terms— Metric learning, Nesterov acceleration, a non-convex optimization framework that can incorporate sparsity, low-rank, proximal gradient methods sparsity constraints in addition to the rank. That is, we learn distance metrics B = AAT such that A 2 RN×r for a 1. INTRODUCTION given r N and A is sparse. A sparse A has computational benefits like low storage and computational complex- Learning “good distance” metrics for signals is key in real- ity. Consequently, this work could be useful in sparse low- world applications, such as data classification and retrieval. rank matrix factorization which has numerous applications in For instance, in the k-nearest-neighbor (KNN) classifier, we machine learning including learning [7] and deep neural net- identify the k-nearest labeled images given a test image in the works (deep learning) [8] and autoencoding. This work is also space of visual features. Hence, it is important to learn met- related to optimizing projection matrices introduced in [9]. rics that capture the similarity as well as the dissimilarity of Our approach can also incorporate additional convex con- datasets. Interestingly, several works have shown that prop- straints on A. To illustrate our algorithm, we use a symmet- erly chosen distance metrics can significantly benefit KNN ric and non-smooth version of the MLP (1) in the manner accuracy as compared to the usual Euclidean metric [1–3]. of [4, 5]. We then use the Nesterov smoothing approach to In the standard metricp learning problem (MLP) we learn obtain a smooth cost that has approximation guarantees to the T N a (semi-) norm kxkB = x Bx that respects data x 2 R original problem, followed by Burer-Monteiro splitting [10] dissimilarity constraints on a set D while obtaining the best with quasi-Newton enhancements. Even without the sparsity similarity on a set S: constraints, our algorithmic approach is novel and leads to X 2 improved results over the previous state-of-the-art. min kxi − xjkB B0 (xi;xj )2S The paper is organized as follows. Section 2 sets up the X (1) notation used in the paper and introduces the needed defini- subject to kx − x k2 ≥ 1: i j B tions; while Section 3 describes the problem and its reformu- (xi;xj )2D lation into an appropriate optimization framework. Section This work was supported in part by the European Commission un- 4 states the algorithm and its theoretical background; and its der Grant MIRG-268398, ERC Future Proof, SNF 200021-132548, SNF followed by Section 5 which showcases the experimental re- 200021-146750 and SNF CRSII2-147633. The author names are in alpha- betical order. sults. Section 6 concludes the paper. N×N M 2. PRELIMINARIES Let us define a linear transform A : S+ ! R as T Notation: We denote integer scalars by lowercase letters A (B) := diag V BV , where diag(H) denotes a vector (e.g., i; k; r; p), real scalars by lowercase Greek letters (e.g., of the entries of the principal diagonal of the matrix H. α; δ; λ), sets by uppercase calligraphic letters (e.g., S), vec- Learning the B that minimizes δ for our symmetric MLP tors by lowercase boldface letter (e.g., x) and matrices by (1) therefore becomes the following non-convex problem: N×N uppercase boldface letter (e.g., X). We denote S+ as the min kA (B) − 1M k1 set of positive semidefinite (PSD) matrices. The usual `1 B (6) norm and `0 pseudo-norm (number of non-zero entries) are subject to B 0; rank(B) = r: extended to matrices by reshaping the matrix to a vector. In Section 3, we reformulate (1) into a problem that is Instead of learning B directly, we instead opt to learn tightly connected with learning Johnson-Lindenstrauss (JL) its factors, i.e., B = AAT , which is also known as Burer- embeddings or restricted isometry property (RIP) dimension- Monteiro splitting [10]: Φ 2 r×N p ality reduction. A matrix R is a JL embedding of T p min kA AA − 1M k1 (7) points, X := fxig , if r = O (log p) and there exists a pos- N×r i=1 A2R itive constant δ > 0 such that the following relations holds: The advantages of the non-convex formulation (7) is two fold: 2 2 2 (1−δ)kxi−xjk2 ≤ kΦ(xi−xj)k2 ≤ (1+δ)kxi−xjk2; (2) (i) it reduces storage space since the optimization variable lives in a much lower dimensional space (i.e., N × r N × for every x ; x 2 X , x 6= x [11, 12]. Using the definition i j i j N), and (ii) it enables us to add additional constraints and of the metric in Section 1, we can rewrite (2) as follows: regularizers on the factors A directly. 2 2 2 (1 − δ)kxi − xjk2 ≤ kxi − xjkB ≤ (1 + δ)kxi − xjk2; (3) For instance, while the rank constraints can be achieved by constraining that the dimension of A be N × r, we can where B = ΦT Φ. Without loss of generality, to simplify the also consider adding an ` -norm constraint as well as an ` - algebra, we are going to enforce the similarity or dissimilarity 0 1 regularizer term (as in [15]) to have the following more gen- constraints on normalized differences of data points: eral formulation. Definition 1. Given X ⊂ N we define the set of secants R T min kA AA − 1M k1 + λkAk1 vectors of points xi; xj 2 X with xi 6= xj as: N×r A2R (8) xi − xj subject to: kAk0 ≤ σ: S(X ) := vij = : (4) kxi − xjk2 The `0-norm constraint enables us to specify a priori the sparsity of the output of our algorithm. It is also possible to add a 3. PROBLEM DESCRIPTION constraint on the Frobenius norm of A directly. In this section, we set up the basic optimization problem that Note that the approach in [4] solves a convex relaxation reveals the computational difficulty of (1). In general, the of (6), where the rank constraint is replaced by a nuclear norm MLP (1) considers relationships between points based on constraint. That solution works in the N × N space and their pairwise distances. Hence, we would require that the requires eigendecompositions. In contrast, we do not do a metric B preserves the pairwise distances of the points in D eigendecomposition in our approach to recover A. Moreover, up to a distortion parameter δ as in (3) to yield a more strin- we can strictly enforce rank constraints while minimizing δ. gent constraint for (1) while ignoring this requirement for Unfortunately, our problem (8) is not smooth and hence points in S. However, we set up a symmetric problem which we only have access to a subgradient of the objective. Instead, then uses RIP in the manner of [4, 5] that can be adjusted we consider the following smoothed version: depending on the individual application. T min f(A AA − 1M ) + λkAk1 N×r In this symmetric formulation, using secant vectors, (3) A2R (9) simplifies to jvT Bv − 1j ≤ δ for x ; x 2 X with x 6= x . ij ij i j i j subject to: kAk0 ≤ σ: p Re-indexing the vij to vl for l = 1;:::;M, where M = 2 , 2 we form the set of M secant vectors S(X ) = fv1;:::; vM g The simplest choice of a smoothing function is f(z) = kzk2 into an N × M matrix V = [v1;:::; vM ]. Then we learn a which can be interpreted as penalizing the average δ for all se- T B that minimizes jvl Bvl − 1j (for a slight abuse of notation cant vectors. In this paper, we focus on the smoothing func- PM zi/µ −zi/µ we will refer to this quantity as δ) over all vl 2 S(X ).

Metric Learning with Rank and Sparsity Constraints

Anastasios Kyrillidis

University of California San Diego

This Booklet

Tuesday July 9, 1.45-8.00 Pm

DIFFUSION MAPS 43 5.1 Clustering

Structured Msc in Mathematical Sciences - January 2021 Intake

Bubacarr Bah CV

Full Report (Pdf)

AIMS Annual Report TEXT.Indd

Restricted Isometry Constants in Compressed Sensing

Artificial Intelligence and Data Science for Society Special Track

Sparse Matrices for Weighted Sparse Recovery