Similarity Learning for Provably Accurate Sparse Linear Classification

Aur´elien Bellet [email protected] Amaury Habrard [email protected] Marc Sebban [email protected] Laboratoire Hubert Curien UMR CNRS 5516, Universit´eJean Monnet, 42000 Saint-Etienne, France

Abstract a lot of work has gone into automatically learning them from labeled data, leading to the emergence of In recent years, the crucial importance of supervised similarity and learning. Generally metrics in algorithms has speaking, these approaches are based on the reason- led to an increasing interest for optimizing able intuition that a good similarity function should distance and similarity functions. Most of assign a large (resp. small) score to pairs of points of the state of the art focus on learning Maha- the same class (resp. different class). Following this lanobis distances (requiring to fulfill a con- idea, they aim at finding the parameters (usually a straint of positive semi-definiteness) for use matrix) of the function such that it satisfies best these in a local k-NN algorithm. However, no theo- local pair-based constraints. Among these methods, retical link is established between the learned Mahalanobis distance learning (Schultz & Joachims, metrics and their performance in classifica- 2003; Shalev-Shwartz et al., 2004; Davis et al., tion. In this paper, we make use of the formal 2007; Jain et al., 2008; Weinberger & Saul, 2009; framework of (ǫ,γ,τ)-good similarities intro- Ying et al., 2009) has attracted a lot of interest, duced by Balcan et al. to design an algorithm because it has a nice geometric interpretation: the for learning a non PSD linear similarity op- goal is to learn a positive semi-definite (PSD) matrix timized in a nonlinear feature space, which which linearly projects the data into a new feature is then used to build a global linear classi- space where the standard Euclidean distance performs fier. We show that our approach has uniform well. Some work has also gone into learning arbitrary stability and derive a generalization bound similarity functions with no PSD constraint to make on the classification error. Experiments per- the problem easier to solve (Chechik et al., 2009; formed on various datasets confirm the effec- Qamar, 2010). The (dis)similarities learned with the tiveness of our approach compared to state- above-mentioned methods are typically plugged in a of-the-art methods and provide evidence that k-Nearest Neighbor (k-NN) classifier (whose decision (i) it is fast, (ii) robust to overfitting and (iii) rule is based on a local neighborhood) and often produces very sparse classifiers. lead to greater accuracy than the standard Euclidean distance, although no theoretical evidence supports this behavior. However, it seems unclear whether 1. Introduction they can be successfully used in the context of global The notion of (dis)similarity plays an important classifiers, such as linear separators. role in many machine learning problems such as Recently, Balcan et al. (2008) introduced the formal classification, clustering or ranking. For this reason, notion of (ǫ,γ,τ)-good similarity function, which does researchers have studied, in practical and formal not require positive semi-definiteness and is less re- ways, what it means for a pairwise similarity function strictive than local pair-based constraints. Indeed, it to be “good”. Since manually tuning such functions basically says that for most points, the average simi- can be difficult and tedious for real-world problems, larity scores to some points of the same class should

th be greater than to some points of different class. As- Appearing in Proceedings of the 29 International Confer- suming this property holds, generalization guarantees ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). can be derived in terms of the error of a sparse linear Similarity Learning for Provably Accurate Sparse Linear Classification classifier built from this similarity function. x′ should be similar/dissimilar”) or triplet-based (“x should be more similar to x′ than to x′′”). A great In this paper, we use the notion of (ǫ,γ,τ)-goodness to deal of work has focused on learning a (squared) design a new similarity learning algorithm. This novel Mahalanobis distance defined by d2 (x, x′) = (x approach, called SLLC (Similarity Learning for Linear M x′)T M(x x′) and parameterized by the PSD matrix− Classification), has several advantages: (i) it is tailored M Rd×−d. A Mahalanobis distance implicitly corre- to linear classifiers, (ii) theoretically well-founded, (iii) sponds∈ to computing the Euclidean distance after some does not require positive semi-definiteness, and (iv) linear projection of the data. The PSD constraint en- is in a sense less restrictive than pair-based settings. sures that this interpretation holds and makes d a We formulate the problem of learning a good similar- M (pseudo)metric, which enables k-NN speed-ups based ity function as an efficient convex quadratic program on (for instance) the triangle inequality. The meth- which optimizes a bilinear similarity. Furthermore, ods available in the literature mainly differ by their by using the Kernel Principal Component Analysis choices of objective/loss function and regularizer on (KPCA) trick (Chatpatanasiri et al., 2010), we are M. For instance, Schultz & Joachims (2003) require able to kernelize our algorithm and thereby learn more examples to be closer to similar examples than to dis- powerful similarity functions and classifiers in the non- similar ones by a certain margin. Weinberger & Saul linear feature space induced by a kernel. On the the- (2009) define an objective function related to the k- oretical point of view, we show that our approach has NN error on the training set. Davis et al. (2007) reg- uniform stability (Bousquet & Elisseeff, 2002), which ularize using the LogDet divergence (for its automatic leads to generalization guarantees in terms of the enforcement of PSD) while Ying et al. (2009) use the (ǫ,γ,τ)-goodness of the learned similarity. Lastly, we (2,1)-norm (which favors a low-rank M). There also provide an experimental study on seven datasets of exist purely online methods (Shalev-Shwartz et al., various domains and compare SLLC with two widely- 2004; Jain et al., 2008). The bottleneck of many of used metric learning approaches. This study demon- these approaches is to enforce the PSD constraint on strates the practical effectiveness of our method and M, although some manage to reduce this computa- shows that it is fast, robust to overfitting and induces tional burden by developing specific solvers. There has very sparse classifiers, making it suitable for dealing also been some interest in learning arbitrary similar- with high-dimensional data. ity functions with no PSD requirement (Chechik et al., The rest of the paper is organized as follows. Section 2 2009; Qamar, 2010). All of the previously-mentioned reviews some past work in similarity and metric learn- learned similarities are used in the context of nearest- ing and introduces the theory of (ǫ,γ,τ)-good similar- neighbors approaches (sometimes in clustering), which ities. Section 3 presents our approach, SLLC, and the are based on local neighborhoods. In practice, they of- KPCA trick used to kernelize it. Section 4 provides ten outperform standard similarities, although no the- a theoretical analysis of SLLC, leading to the deriva- oretical argument supports this behavior. tion of generalization guarantees. Finally, Section 5 However, these local constraints do not seem appro- features an experimental study on various datasets. priate to learn a similarity function for use in global classifiers, such a linear separators. The theory pre- 2. Notations and Related Work sented in the next section introduces a new, different notion of the goodness of a similarity function, and We denote vectors by lower-case bold symbols (x) and shows that such a good similarity achieves bounded matrices by upper-case bold symbols (A). We consider error in linear classification. This opens the door to labeled points z = (x, ℓ) drawn from an unknown dis- similarity learning for improving linear classifiers. tribution P over Rd 1, 1 . A similarity function is defined by K : Rd× {−Rd }[ 1, 1]. We denote the × → − 2.2. Learning with Good Similarity Functions L -norm by and the Frobenius norm by F . 2 k · k2 k · k Lastly, [1 c]+ = max(0, 1 c) denotes the hinge loss. In recent work, Balcan et al. (2008) introduced a new − − theory of learning with good similarity functions, based 2.1. Metric and Similarity Learning on the following definition. Supervised metric and similarity learning aims at finding the parameters (usually a matrix) of a Definition 1 (Balcan et al., 2008) A similarity (dis)similarity function such that it best satisfies lo- function K is an (ǫ,γ,τ)-good similarity function cal constraints derived from the class labels. These in hinge loss for a learning problem P if there ex- constraints are typically pair-based (“examples x and ists a (random) indicator function R(x) defining a (probabilistic) set of “reasonable points” such that the Similarity Learning for Provably Accurate Sparse Linear Classification following conditions hold: solved. The L1-regularization induces sparsity (zero coordinates) in α, which reduces the number of train- E 1. (x,ℓ)∼P [[1 ℓg(x)/γ]+] ǫ, ing examples the classifier is based on, speeding up − ′≤ ′ ′ where g(x) = E x′ ′ ∼ [ℓ K(x, x ) R(x )], prediction. We can control the amount of sparsity by ( ,ℓ ) P | ′ using the parameter λ (the larger λ, the sparser α). 2. Prx′ [R(x )] τ. ≥ To sum up, the performance of the linear classifier the- Thinking of this definition in terms of number of mar- oretically depends on how well the similarity function gin violations, we can interpret the first condition as satisfies Definition 1. However, for some problems, “an ǫ proportion of examples x are on average 2γ more standard similarity functions may satisfy the defini- similar to random reasonable examples of the same tion poorly, leading to weak guarantees. To deal with class than to random reasonable examples of the op- this limitation, Kar & Jain (2011) propose to automat- posite class” and the second condition as “at least a ically adapt the goodness criterion to the problem at τ proportion of the examples should be reasonable”. hand. In this paper, we take a different approach: we This definition is interesting in three respects. First, see Definition 1 as a novel, theoretically well-founded it does not impose positive semi-definiteness nor sym- objective function for similarity learning. metry, which are requirements that may rule out the most natural similarity functions for some tasks. Sec- 3. Learning Good Similarity Functions ond, it is based on an average over some points, which for Linear Classification is less restrictive than pair or triplet-based settings. ′ T ′ Third, satisfying Definition 1 is sufficient to learn well We consider KA(x, x ) = x Ax , a bilinear similar- (Theorem 1). ity parameterized by the matrix A Rd×d, which is not constrained to be PSD nor symmetric.∈ This Theorem 1 (Balcan et al., 2008) Let K be an form of similarity function was successfully used in the (ǫ,γ,τ)-good similarity function in hinge loss for a context of large-scale online image similarity learning learning problem P . For any ǫ1 > 0 and 0 δ (Chechik et al., 2009). KA has the advantage of be- γǫ /4, let S = x′ ,..., x′ be a (potentially≤ un-≤ ′ 1 { 1 dland } ing efficiently computable when the inputs x and x 2 log(2/δ) 2 are sparse vectors. In order to satisfy the condition labeled) sample of dland = τ log(2/δ) + 16 (ǫ1γ) KA [ 1, 1], we assume that inputs are normalized landmarks drawn from P . Consider the mapping φS: ∈ − ′ such that x 1, and we require A F 1. Rd Rdland defined as follows: φS(x) = K(x, x ), || ||2 ≤ || || ≤ → i i i 1,...,dland . Then, with probability at least 1 δ over∈ { the random} sample S, the induced distribution− 3.1. Similarity Learning Formulation S Rdland α φ (P ) in has a linear separator of error at Our goal is to optimize the (ǫ,γ,τ)-goodness of KA most ǫ + ǫ1 at margin γ. on a finite-size sample. To this end, we are given a training sample of NT labeled points T = zi = In other words, if we are given an (ǫ,γ,τ)-good sim- NT { (xi, ℓ ) and a sample of N labeled reasonable ilarity function for a learning problem P and enough i }i=1 R points R = z = (x , ℓ ) NR . In practice, R is a points (the “landmarks”), there exists a linear separa- k k k k=1 subset of T {with N =τN ˆ} (ˆτ ]0, 1]). In the tor α with error arbitrary close to ǫ, which lies in the R T lack of background knowledge, it can∈ be drawn ran- explicit “φ-space” (the space of the similarity scores domly or according to some criterion (e.g., diversity to the landmarks). As Balcan et al. mention, us- (Kar & Jain, 2011)). Given R and a margin γ, let also ing du (potentially unlabeled) landmark examples and 1 NR V (A, zi,R) = [1 ℓi ℓkKA(xi, xk)] denote dl labeled examples, we can estimate this separator − γNR k=1 + α Rdu 1 the empirical goodness of KA with respect to a single by solving the following linear program: P 1 NT ∈ training point zi, and ǫ = V (A, zi,R) the T NT i=1 dl du empirical goodness over T . ′ P min 1 α ℓ K(xi, x ) + λ α . (1) α  − j i j  k k1 Now, we want to learn the matrix A that minimizes ǫ . i=1 j=1 T X X + This can be done by solving the following regularized   Note that Problem (1) is essentially an L1-regularized problem, referred to as SLLC (Similarity Learning for linear SVM (Zhu et al., 2003) with an empirical simi- Linear Classification): larity map (Balcan et al., 2008), and can be efficiently 2 min ǫT + β A F A∈Rd×d k k 1The original formulation proposed by Balcan et al. (2008) was actually L1-constrained. We turned it into an where β is a regularization parameter. Note that an equivalent L1-regularized one. equivalent constrained formulation can be obtained by Similarity Learning for Provably Accurate Sparse Linear Classification rewriting the sum of NT hinge losses in the objective duction at no additional cost. The idea is to use Kernel function as NT margin constraints and introducing NT Principal Component Analysis (Sch¨olkopf et al., 1998) slack variables in the objective. to project the data into a new space using a nonlin- ear kernel function, and to keep only a chosen num- SLLC is radically different from classic metric and sim- ber of dimensions (those that capture best the overall ilarity learning algorithms, which are based on pair or variance of the data). The data are then projected triplet-based constraints. It learns a global similar- into this new feature space, and the (unchanged) met- ity rather than a local one, since R is the same for ric learning algorithm can be used to learn a metric each training example. Moreover, the constraints are in that space. Chatpatanasiri et al. (2010) showed easier to satisfy since they are defined over an aver- that the KPCA trick is theoretically sound for un- age of similarity scores to the points in R instead of constrained metric and similarity learning algorithms over a single pair or triplet. This means that one can (they proved representer theorems), which includes fulfill a constraint without satisfying the margin for SLLC. Throughout the rest of this paper, we will only each point in R individually. SLLC has also a number consider the kernelized version of SLLC. of desirable properties: (i) This is a convex quadratic program, which can be solved efficiently using stan- Generally speaking, kernelizing a metric learning al- dard solvers. No costly semi-definite programming is gorithm may cause or increase overfitting, especially required, as opposed to many Mahalanobis distance when data are scarce and/or high-dimensional. How- learning methods. (ii) In the constrained formulation, ever, since our entire framework is linear and global, there is only one constraint per training example (in- we expect our method to be quite robust to this effect. stead of one for each pair or triplet), i.e., a total of NT This will be doubly confirmed in the rest of this pa- 2 constraints and NT + d variables. (iii) The size of R per: experimentally in Section 5, but also theoretically does not affect the complexity of the constraints. (iv) with the derivation in the following section of gener- If xi is sparse, then the associated constraint is sparse alization guarantees independent from the size of the as well (some variables of the problem do not appear). projection space.

3.2. Kernelization of SLLC 4. Theoretical Analysis The framework presented in the previous section is In this section, we present a theoretical analysis of theoretically well-founded with respect to Balcan et our approach. Our main result is the derivation of al.’s theory and has some generalization guarantees, a generalization bound (Theorem 3) guaranteeing the as we will see in the next section. Moreover, it has consistency of SLLC and thus the (ǫ,γ,τ)-goodness the advantage of being very simple: we learn a global for the considered task. In our framework, the simi- linear similarity and use it to build a global linear larity is optimized according to a set R of reasonable classifier. In order to learn more powerful similar- points coming from the training sample. Therefore, ities (and therefore classifiers), we propose to ker- these reasonable points may not follow the distribu- nelize the approach by learning them in the nonlin- tion from which the training sample has been gener- ear feature space induced by a kernel. Kernelization ated. To cope with this situation, we propose to derive allows linear classifiers such as Support Vector Ma- a generalization bound according to the framework of chines or some Mahalanobis distance learning algo- uniform stability (Bousquet & Elisseeff, 2002), which rithms (e.g., Shalev-Shwartz et al., 2004; Davis et al., does not assume an i.i.d. draw at the pair level. 2007) to learn nonlinear decision boundaries or trans- formations. However, kernelizing a metric learning al- 4.1. Uniform Stability gorithm is not trivial: a new formulation of the prob- lem has to be derived, where interface to the data is Roughly speaking, an algorithm is stable if its output limited to inner products, and sometimes a different does not change significantly under a small modifica- implementation is necessary. Moreover, when kernel- tion of the training sample. This variation is required ization is possible, one must learn a N N matrix. to be bounded in (1/N ) in terms of infinite norm. T × T O T As NT gets large, the problem becomes intractable un- less dimensionality reduction is applied. Definition 2 (Bousquet & Elisseeff, 2002) A learning algorithm has a uniform stability in κ NT For these reasons, we instead use the KPCA trick, re- w.r.t. a loss function , with κ a positive constant, if L cently proposed by Chatpatanasiri et al. (2010). It κ provides a straightforward way to kernelize a metric T, i, 1 i NT , sup (MT , z) (MT i , z) , ∀ ∀ ≤ ≤ z |L −L | ≤ N learning algorithm while performing dimensionality re- T where MT is the model learned from the sample T , Similarity Learning for Provably Accurate Sparse Linear Classification

i i ′ ′ ′ MT i the model learned from the sample T . T is ob- P2: KA(x, x ) KA′ (x, x ) A A F , th | − | ≤ k − k tained from T by replacing the i example zi T N P R ℓ KA(x,x ) by another example z′ independent from T and drawn∈ P3: V (A, z,R) V (A′, z,R′) 1 k=1 k k i γNR | N ′ ′ − ′ | ≤ | − from P . (M, z) is the loss for an example z. P R ℓ K ′ (x,x ) L j=1 k A k (1-admissibility property of V ). γNR′ | In this definition, T i characterizes the notion of small ′ ′ modification of the training sample. When this def- Proof P 1 comes from |KA(x, x )| ≤ kxk2kAkF kx k2, the inition is fulfilled, Bousquet & Elisseeff (2002) have normalization on examples (kxk2 ≤ 1) and the requirement on matrices (kAk ≤ 1). shown that the following generalization bound holds. F ′ ′ For P 2, we observe that |KA(x, x ) − KA′ (x, x )| = ′ Theorem 2 (Bousquet & Elisseeff, 2002) Let |KA−A′ (x, x )|, and we use the normalization kxk2 ≤ 1. δ > 0 and N > 1. For any algorithm with uniform T P 3 follows directly from |ℓ| = 1 and the 1-lipschitz property stability κ/NT using a loss function bounded by 1, of the hinge loss: |[X]+ − [Y ]+| ≤ |X − Y |. 2 with probability 1 δ over the random draw of T : − 2 Let FT = ǫT (A) + β A F be the objective function of κ ln 1/δ SLLC w.r.t. a samplek Tkand a set of reasonable points L(MT ) < LˆT (MT ) + + (2κ + 1) , NT s 2NT R T . The following lemma bounds the deviation between⊆ A and Ai. where L(MT ) is the expected loss and LˆT (MT ) its em- pirical estimate over T . Lemma 2 For any models A and Ai that are mini- mizers of FT and FT i respectively, we have: 4.2. Generalization Bound i 1 A A F . For convenience, given a bilinear model KA, we de- k − k ≤ βNT γ note by AR both the similarity defined by the ma- trix A and its associated set of reasonable points R Proof We follow closely the proof of Lemma 20 of (Bousquet & Elisseeff, 2002) and omit some details due to i (when it is clear from the context we may omit the the limitation of space. Let ∆A = A − A and 0 ≤ t ≤ 1, subscript R). Given a similarity A , V (A , z,R) 2 2 i 2 i 2 R R M1 = kAkF − kA + t∆AkF + kA kF − kA − t∆AkF and 1 plays the role of the loss function over one exam- M2 = (ǫT (AR)−ǫT ((A+t∆A)R)+ǫ i ((A+t∆A)R)− βNT T ple z. The loss over the sample T is defined as ǫT i (AR)). Using the fact that FT and FT i are convex func- ǫ (A ) = 1 NT V (A , z ,R), and corresponds tions, that A and Ai are their respective minimizers and T R NT i=1 R i to the empirical goodness. Lastly, the expected property P3, we have M1 ≤ M2. Fixing t = 1/2, we ob- i 2 P tain M1 = kA − A k , and using property P 3 and the loss over the true distribution is given by ǫ(AR) = F E normalization kxk2 ≤ 1, we get: z=(x,l)∼P V (AR, z,R), and corresponds to the good- i ness in generalization. When it is clear from the con- 1 1 1 kA − A kF M2 ≤ (k ∆AkF + k − ∆AkF ) = . text we may simply use ǫT and ǫ. βNT γ 2 2 βNT γ

A Ai In our case, to prove the uniform stability property we This leads to the inequality kA − Aik2 ≤ k − kF from F βNT γ need to show that which Lemma 2 is directly derived. 2 κ T, i, sup V (A, z,R) V (Ai, z,Ri) , ∀ ∀ z | − | ≤ NT We now have all the material needed to prove the sta- bility property of our algorithm. where A is learned from T , R T , Ai is the matrix i i i⊆ learned from T and R T is the set of reason- Lemma 3 Let NT and NR be the number of train- ⊆i i able points associated to T . Note that R and R are ing examples and reasonable points respectively, NR = of equal size and can differ in at most one example, τNˆ T with τˆ ]0, 1]. SLLC has a uniform stability in ′ κ ∈1 1 2 τˆ+2βγ depending on whether zi or zi belong to their cor- with κ = ( + ) = 2 , where β is the reg- NT γ βγ τˆ τβγˆ responding set of reasonable points. For the sake of ularization parameter and γ the margin. simplicity, we consider V bounded by 1 (which can be 1 Proof For any sample T of size NT , any 1 ≤ i ≤ NT , any easily obtained by dividing it by the constant 1 + γ ). ′ ′ To show this property, we need the following results. labeled examples z = (x,ℓ) and zi = (xi,ℓi) ∼ P : |V (A, z,R) − V (Ai, z,Ri)| Lemma 1 For any labeled examples z = (x, ℓ), z′ = ′ ′ ′ NR NRi (x , ℓ ) and any models AR, AR′ , the following holds: ˛ 1 1 ˛ ≤ ˛ ℓkKA(x, xk) − ℓkKAi (x, xk)˛ ˛ γNR X γNRi X ˛ ′ ˛ k=1 k=1 ˛ P1: KA(x, x ) 1, ˛ ˛ | | ≤ ˛ ˛ Similarity Learning for Provably Accurate Sparse Linear Classification

NR ˛ 1 00 1 minimizing the LogDet divergence between the learned = ˛ (ℓkKA(x, xk) − KAi (x, xk)) + ˛ γNR X matrix M and the identity matrix. We conduct this ˛ @@k=1,k6=i A ˛ experimental study on seven classic binary datasets of ˛ varying domain, size and difficulty, mostly taken from ′ ′ 1˛ ℓiKA(x, xi) − ℓiKAi (x, xi) ˛ ˛ the UCI Machine Learning Repository. Their proper- A˛ ˛ ties are summarized in Table 1. Some of them, such ˛ NR as Breast, Ionosphere or Pima, have been extensively 1 00 i 1 ≤ (|ℓk|kA − A kF ) + used to evaluate metric learning methods. γNR X @@k=1,k6=i A 5.1. Experimental Setup ′ ′ 1 |ℓiKAi (x, xi)| + |ℓiKA(x, x i)| A We compare the following methods: (i) the cosine 1 N − 1 1 N similarity KI in KPCA space, as a baseline, (ii) ≤ ( R + 2) ≤ ( R + 2). γNR βNT γ γNR βNT γ SLLC, (iii) LMNN in the original space, (iv) LMNN in KPCA space, (v) ITML in the original space, and The first inequality follows from P 3. The second comes (vi) ITML in KPCA space.3 All attributes are scaled from the fact that R and Ri differ in at most one element, to [ 1/d; 1/d] to ensure x 2 1. To generate a new ′ − k k ≤ corresponding to the example zi in R and the example zi feature space using KPCA, we use the Gaussian ker- i replacing zi in R . The last inequalities are obtained by the nel with parameter σ equal to the mean of all pairwise use of the triangle inequality, P 1, P 2, Lemma 2, and the training data Euclidean distances (a standard heuris- fact that the labels belong to {−1, 1}. Since NR =τN ˆ T , tic, used for instance by Kar & Jain (2011)). Ideally, we get |V (A, z,R) − V (Ai, z,Ri)| ≤ 1 ( 1 + 2 ). 2 γNT βγ τˆ we would like to project the data to the feature space of maximum size (equal to the number of training exam- Applying Thm 2 with Lemma 2 gives our main result. ples), but to keep the computations tractable we only retain three times the number of features of the origi- nal data (four times for the low-dimensional datasets), Theorem 3 Let γ > 0, δ > 0 and N > 1. With T as shown in Table 1.4 On Cod-RNA, KPCA was run probability at least 1 δ, for any model A learned R on a randomly drawn subsample of 10% of the train- with SLLC, we have: − ing data. Unless predefined training and test sets are available (as for Splice, Svmguide1 and Cod-RNA), we 1 τˆ + 2βγ 2(ˆτ + 2βγ) ln 1/δ ǫ ǫ + + + 1 . randomly generate 70/30 splits of the data, and aver- ≤ T N τβγˆ 2 τβγˆ 2 2N T     s T age the results over 100 runs. Training sets are further partitioned 70/30 for validation purposes. We tune Thm 3 highlights three important properties of SLLC. the following parameters by cross-validation: β, γ −7 −2 −4 4∈ First, it converges in (1/√NT ), which is a standard 10 ,..., 10 for SLLC, λITML 10 ,..., 10 { } − ∈ { } convergence rate forO uniform stability. Second, it is for ITML, and λ 10 3,..., 102 for learning the ∈ { } independent from the dimensionality of the data. This linear classifiers, choosing the value offering the best is due to the fact that A F is bounded by a constant. accuracy. We choose R to be the entire training set, Third, Thm 3 boundsk thek goodness in generalization i.e.,τ ˆ = 1 (interestingly, cross-validation ofτ ˆ did not of the learned similarity function. By minimizing ǫT improve the results significantly). We take k = 3 and with SLLC, we minimize ǫ and thus the error of the µ = 0.5 for LMNN, as done in (Weinberger & Saul, resulting linear classifier, as stated by Thm 1. 2009). For ITML, we generate NT random constraints for a fair comparison with SLLC. 5. Experiments 5.2. Results We propose a comparative study of our method and Classification performance We report the results two widely-used Mahalanobis distance learning algo- obtained with the sparse linear classifier of Problem (1) rithms: Large Margin Nearest Neighbor (LMNN) from suggested by Balcan et al. (2008) but also those ob- Weinberger & Saul (2009) and Information-Theoretic tained with 3-NN since LMNN and ITML are designed Metric Learning (ITML) from Davis et al. (2007).2 3 Roughly speaking, LMNN optimizes the k-NN error on KI , LMNN and ITML are normalized to ensure their the training set (with a safety margin) whereas ITML values belong to [−1, 1]. 4 aims at best satisfying pair-based constraints while Note that the amount of variance captured thereby was greater than 90% for all datasets. 2We used the code provided by the authors. Similarity Learning for Provably Accurate Sparse Linear Classification

Table 1. Properties of the seven datasets used in the experimental study.

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA # training examples 488 245 700 537 1,000 3,089 59,535 # test examples 211 106 300 231 2,175 4,000 271,617 # dimensions 9 34 2 8 60 4 8 # dim. after KPCA 27 102 8 24 180 16 24 # runs 100 100 100 100 1 1 1

Table 2. Average accuracy (normal type) and sparsity (italic type) of the linear classifiers built from the studied similarity functions. For each dataset, bold font indicates the most accurate method (sparsity is used to break the ties).

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA 96.57 89.81 100.00 75.62 83.86 96.95 95.91 K I 20.39 52.93 18.20 25.93 362 64 557 96.90 93.25 100.00 75.94 87.36 96.55 94.08 SLLC 1.00 1.00 1.00 1.00 1 8 1 96.81 90.21 100.00 75.15 85.61 95.80 88.40 LMNN 9.98 13.30 18.04 69.71 315 157 61 96.01 86.12 100.00 74.92 86.85 96.53 95.15 LMNN KPCA 8.46 9.96 8.73 22.20 156 82 591 96.80 92.09 100.00 75.25 81.47 96.70 95.06 ITML 9.79 9.51 17.85 56.22 377 49 164 96.23 93.05 100.00 75.25 85.29 96.55 95.14 ITML KPCA 17.17 18.01 15.21 16.40 287 89 206 for k-NN use.5 In linear classification (Table 2), SLLC 94 92 achieves the highest accuracy on 5 out of 7 datasets 90 and competitive performance on the remaining 2. At 88 the same time, on all datasets, SLLC leads to ex- 86 84 tremely sparse classifiers. The sparsity of the classifier 82 corresponds to the number of training examples that 80 Classification accuracy 78 SLLC are involved in classifying a new example. Therefore, LMNN 76 ITML SLLC leads to much simpler and yet often more accu- 74 0 50 100 150 200 250 rate classifiers than those built from other similarities. Dimension Furthermore, sparsity allows faster predictions, espe- cially when data are plentiful and/or high-dimensional Figure 1. Accuracy of the methods with respect to the di- (e.g., Cod-RNA or Splice). Often enough, the learned mensionality of the KPCA space on Ionosphere. linear classifier has sparsity 1, which means that clas- sifying a new example boils down to computing its similarity score to a single training example and com- Also note that a good similarity for k-NN classification pare the value with a threshold. Note that we tried can achieve poor results in linear classification (LMNN large values of λ to obtain sparser classifiers from KI , on Cod-RNA), and vice versa (SLLC on Svmguide1). LMNN and ITML, but this yielded dramatic drops in accuracy. The extreme sparsity brought by SLLC Robustness to overfitting The fact that SLLC comes from the fact that the constraints are based on performs well on small datasets is partly due to its an average of similarity scores over the same set of robustness to overfitting, highlighted in Figure 1. As points for all training examples. Surprisingly, in 3-NN expected, LMNN and ITML, which are optimized lo- classification (Table 3), SLLC achieves the best results cally and plugged in a local nonlinear classifier, tend on 4 datasets. It is, however, outperformed by LMNN to overfit as the dimensionality grows. On the other or ITML on the 3 biggest problems. For most tasks, hand, SLLC suffers from very limited overfitting due the accuracy obtained in linear classification is better to its global and linear setting. or similar to that obtained with 3-NN (highlighting the fact that similarity learning for linear classification is Time comparison Note that SLLC is solved us- of interest) while prediction is many orders of magni- ing the standard convex programming package Mosek tude faster due to the sparsity of the linear separators. while LMNN and ITML have their own specific and so-

5 phisticated solver. Despite this fact, SLLC is several When necessary, we take the opposite value of a simi- orders of magnitude faster than LMNN (see Table 4) larity to get a measure of distance, and vice versa. because its number of constraints is much smaller. Similarity Learning for Provably Accurate Sparse Linear Classification

Table 3. Average accuracy of 3-NN classifiers using the studied similarity functions.

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA

KI 96.71 83.57 100.00 72.78 77.52 93.93 90.07 SLLC 96.90 93.25 100.00 75.94 87.36 93.82 94.08 LMNN 96.46 88.68 100.00 72.84 83.49 96.23 94.98 LMNN KPCA 96.23 87.13 100.00 73.50 87.59 95.85 94.43 ITML 92.67 88.29 100.00 72.07 77.43 95.97 95.42 ITML KPCA 96.38 87.56 100.00 72.80 84.41 96.80 95.32

Table 4. Average time per run (in seconds) required for learning the similarity.

Breast Iono. Rings Pima Splice Svmguide1 Cod-RNA SLLC 4.76 5.36 0.05 4.01 158.38 185.53 2471.25 LMNN 25.99 16.27 37.95 32.14 309.36 331.28 10418.73 LMNN KPCA 41.06 34.57 84.86 48.28 1122.60 369.31 24296.41 ITML 2.09 3.09 0.19 2.96 3.41 0.83 5.98 ITML KPCA 1.68 5.77 0.20 2.74 56.14 5.30 25.25

However, it remains slower than ITML. Bousquet, O. and Elisseeff, A. Stability and Generaliza- tion. JMLR, 2:499–526, 2002. 6. Conclusion Chatpatanasiri, R., Korsrilabutr, T., Tangchanachaianan, P., and Kijsirikul, B. A new kernelization framework for In this paper, we presented SLLC, a novel approach Mahalanobis distance learning algorithms. Neurocom- to the problem of similarity learning by making use of puting, 73:1570–1579, 2010. both Balcan et al.’s theory of (ǫ,γ,τ)-good similarity Chechik, G., Shalit, U., Sharma, V., and Bengio, S. An On- functions and the KPCA trick. We derived a general- line Algorithm for Large Scale Image Similarity Learn- ization bound based on the notion of uniform stability ing. In NIPS, pp. 306–314, 2009. that is independent from the size of the input space, Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, and thus from the number of dimensions selected by I. S. Information-theoretic metric learning. In ICML, KPCA. It guarantees the goodness in generalization pp. 209–216, 2007. of the learned similarity, and therefore the accuracy of Jain, P., Kulis, B., Dhillon, I. S., and Grauman, K. Online the resulting linear classifier for the considered task. Metric Learning and Fast . In NIPS, We experimentally demonstrated the effectiveness of pp. 761–768, 2008. SLLC and also showed that the learned similarities in- duce extremely sparse classifiers. Combined with the Kar, P. and Jain, P. Similarity-based Learning via Data Driven Embeddings. In NIPS, 2011. independence from dimensionality and the robustness to overfitting, it makes the approach very efficient and Qamar, A. M. Generalized Cosine and Similarity Met- suitable for high-dimensional data. Future work could rics: A approach based on nearest- include a “full” kernelization of SLLC (i.e., express the neighbors. PhD thesis, University of Grenoble, 2010. problem solely in terms of inner products), studying Sch¨olkopf, B., Smola, A., and Mller, K.-R. Nonlinear com- the influence of other regularizers on M (for instance, ponent analysis as a kernel eigenvalue problem. Neural using the nuclear norm to learn low-rank matrices), de- Computation, 10(1):1299–1319, 1998. veloping a specific solver to match ITML’s speed and Schultz, M. and Joachims, T. Learning a Distance Metric the derivation of an online algorithm. from Relative Comparisons. In NIPS, 2003. Shalev-Shwartz, S., Singer, Y., and Ng, A. Y. Online and Acknowledgments batch learning of pseudo-metrics. In ICML, 2004. We would like to acknowledge support from the ANR Weinberger, K. Q. and Saul, L. K. Distance Metric Learn- LAMPADA 09-EMER-007-02 project and the PAS- ing for Large Margin Nearest Neighbor Classification. JMLR, 10:207–244, 2009. CAL 2 Network of Excellence. Ying, Y., Huang, K., and Campbell, C. Sparse Metric Learning via Smooth Optimization. In NIPS, pp. 2214– References 2222, 2009. Balcan, M.-F., Blum, A., and Srebro, N. Improved Guar- Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1-norm antees for Learning via Similarity Functions. In COLT, Support Vector Machines. In NIPS, volume 16, pp. 49– pp. 287–298, 2008. 56, 2003.