<<

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Matrix Variate Gaussian Mixture Distribution Steered Robust Metric Learning

Lei Luo, Heng ∗ Electrical and Computer Engineering, University of Pittsburgh, USA [email protected], [email protected]

Abstract rithms have shown great success in many applications, find- ing a robust distance metric from real-world data remains a Mahalanobis Metric Learning (MML) has been actively stud- big challenge. ied recently in machine learning community. Most of existing MML methods aim to learn a powerful Mahalanobis distance To enhance the performance of metric learning models, for computing similarity of two objects. More recently, mul- some recent methods, represented by low-rank or sparse tiple methods use matrix norm regularizers to constrain the metric learning (, Nie, and Huang 2016; Ying, Huang, learned distance matrix M to improve the performance. How- and Campbell 2009), actively integrate different regulariza- ever, in real applications, the structure of the distance matrix tion terms associated with the matrix M in modeling. On M is complicated and cannot be characterized well by the one hand, using these regularization terms can effectively simple matrix norm. In this paper, we propose a novel robust prevents the overfitting problem since it is equivalent to metric learning method with learning the structure of the dis- M M adding the prior knowledge into . On the other hand, these tance matrix in a new and natural way. We partition into regularization terms can extract partial structure information blocks and consider each block as a random matrix variate, M which is fitted by matrix variate Gaussian mixture distribu- of the matrix . The structure of a variable corresponds to tion. Different from existing methods, our model has no any its spatial distribution. However, for many practical appli- assumption on M and automatically learns the structure of cations, the spatial distribution of M is generally irregular M from the real data, where the distance matrix M often is and complicated. The simple matrix norm regularizers can- neither sparse nor low-rank. We design effective algorithm not characterize the structure information well. For instance, to optimize the proposed model and establish the correspond- L2 or L1-norm regularization is based on the hypothesis that ing theoretical guarantee. We conduct extensive evaluations the elements of M are independently distributed with Gaus- on the real-world data. Experimental results show our method sian distribution or Laplace distribution (Luo et al. 2016b). consistently outperforms the related state-of-the-art methods. Obviously, they cannot capture the spatial structure of M. The low-rank and Fantope regularizers (Law, Thome, and Introduction Cord 2014) overcome this limitation, but they lack the gen- eralization because the distance matrix M may not be low- Mahalanobis metric learning (MML) has been actively stud- rank for real data. ied in machine learning community and successful applied Many previous works make use of multiple matrix norms to address numerous applications (Kuznetsova et al. 2016). to jointly characterize a matrix variate with complex distri- MML methods target to learn a good Mahalanobis met- bution. Nevertheless, these strategies are limited to some ric to effectively gauge the pairwise distance between data special cases. For example, Least Soft-threshold Squares objects. Particularly, the distance matrix M plays a cru- method (Wang, , and 2013) is only suitable for cial role in MML. A well-learned M can precisely reflect the variate following Gaussian-Laplace distribution, while domain-specific connections and relationships. Toward this Nuclear-L1 joint regression model (Luo et al. 2015; 2016a) end, many metric learning algorithms under various prob- focuses on the structural and sparse matrix variate. To adapt lem settings have been proposed, such as pairwise con- more practical problems, some scholars carry out the variate strained component analysis (PCCA) (Mignon and Jurie estimation task under the framework of the Gaussian Mix- 2012), neighborhood repulsed metric learning (NRML) (Lu ture Regression (GMR) (Cao et al. 2015). This is originated et al. 2014), large margin nearest neighbor (LMNN) (Wein- from a basic fact: Gaussian Mixture distribution can con- berger and Saul 2009), logistic discriminant metric learn- struct a universal approximator to any continuous density ing (LDML) (Guillaumin, Verbeek, and Schmid 2009), and function in theory (Bishop 2007). The experimental results Hamming distance learning (Zheng, Tang, and Shao 2016; show the advantages of this strategy, but most of these meth- Zheng and Shao 2016). Although these supervised algo- ods (related to GMR) are based on regression analysis. As ∗ To whom all correspondence should be addressed. we know, many practical problems cannot be formulated as Copyright c 2018, Association for the Advancement of Artificial regression-like models. Accordingly, it is expected to extend Intelligence (www.aaai.org). All rights reserved. GMR to other formulations. Additionally, these GMR based

3722 approaches assume the elements in a matrix variate are gen- erated independently for the convenience, which overlooks the latent relationships between elements. In this paper, we propose a novel metric learning model that utilizes the Gaussian Mixture Distribution (GMD) to automatically learn the structure information of the distance matrix M from the real data. To the best of our knowledge, this is the first work for depicting a matrix variate using GMD in metric learning model. To fully exploit the struc- Figure 1: Partition a matrix M =(mij)d×d into p × q non- ture of M, we partition M into blocks and view each block overlapping blocks: (a) the original matrix M; (b) the parti- d ×d as a random matrix variate which is automatically fitted by tioned result, where each Muv ∈R u v (u =1, ···,p,v = GMD. Since each block represents the local structure of M, 1, ···,q). our method integrates the structure information of different regions of M in modeling. On the basis of GMD, we con- struct a convex regularization with regard to M and use it to provide robustness to noise. To this end, the label in- to derive a robust metric learning model with triplet con- formation can be fully exploited, which leads to different straints. A new effective algorithm is introduced to solve weakly-supervised constraints, including pairwise, triplet the proposed model. Due to the promising generalization of and quadruplet constraints. However, the metric learning GMD, our model does not rely on any assumption on M, re- models with different constraints can be unified as the fol- gardless of whether it holds the low-rank (or sparse) attribute lowing form: or not. Therefore, comparing with existing metric learning Mz = argmin (εz(M)+λΩ(M)), (2) methods using matrix norm regularizers, our method can ef- M∈M fectively learn the structure information for the general dis- tance matrix M. Moreover, as the theoretical contribution of where εz(·) is called loss function, Ω(M) is a regulariza- this paper, we analyze the convexity and generalization abil- tion term, the balance parameter λ>0 and M denotes the d×d ity of the proposed model. A series of experiments on image domain of M. In general, M⊆R+ . classification and face verification demonstrate the effective- The loss function εz(·) is generally induced by differ- ness of our new method. ent constraints. Therefore, the minimization of εz(·) will result in minimizing the distances between the data points Learning A Robust Distance Metric Using with similar constraints and maximizing the distances be- Gaussian Mixture Distribution tween the data points with dissimilar constraints. Recently, In this section, we will first propose a robust objective for the metric learning model: large margin nearest neighbor distance metric learning using grouped Gaussian mixture (LMNN) (Weinberger and Saul 2009) has attracted wide at- distribution. After that, we design an effective optimization tention. It uses triplet constraints on training examples and algorithm to solve the proposed model. the corresponding loss function can expressed as: Notation. Throughout this paper, we write matrices as lmnn(M)=(1− ) D (x x ) bold uppercase characters and vectors as bold lowercase εz μ M i, j i,ji characters. Let y = {y1,y2, ···,yn} be the label set of in- X = {x x ··· x } put (or training) samples 1, 2, , n , where each + μ r(,yl)[1 + DM(xi, xj) (3) ∈Rd ( =1 2 ··· ) xi i , , ,n . For example, the label of sample i,ji,l xi yi r(yi,yl)=1 yi = yl r(yi,yl)=0 is . Let if otherwise . −D (x x )] Suppose input samples X and labels y are contained in a in- M i, l +, put space X and a label space Y, respectively. Meanwhile, 2 where DM(xi, xj)=dM(xi, xj) and the notation j  i we assume z := {zi =(xi,yi):xi ∈X,yi ∈Y,i ∈ Nn}, xj where Nn = {1, 2, ···,n}. For any x ∈R, the function indicates that input is a target neighbor (i.e., the same xi r(yi,yl)=1, xl f(x)=[x]+ is equal to x if x>0 and zero otherwise. labeled inputs) of input . When is an d×d d×d impostor neighbor (i.e., differently labeled inputs) of input R+ defines the set of positive definite matrices on R . xi. And the parameter μ ∈ (0 1) defines a trade-off between   T M F , M and Tr(M) denote the Frobenius-norm, trans- the above two objectives. pose and trace of the matrix M, respectively. Although LMNN significantly improves the performance Problem statement. Given any two data points xi and xj, of traditional kNN classification, it is often confront with the a Mahalanobis distance between them can be calculated as overfitting problem. Then, some regularized LMNN models following: (, Tian, and Tao 2016; Lim, McFee, and Lanckriet 2013) T have emerged, which enhance the generalization and robust- dM(xi, xj)= (xi − xj) M(xi − xj). (1) ness of LMNN. It should be noted that these methods only The core task of metric learning is to learn an optimal utilize some special matrix norm regularizers (e.g.,L1-norm positive semi-definite matrix M such that the distance be- or nuclear norm) to constrain M. In many practical prob- tween similar samples should be relatively smaller than be- lems, however, the distance matrix M of real data may be tween dissimilar samples. Meanwhile, a desired M is able neither sparse nor low-rank.

3723 The robust metric learning model induced by GMD. In Considering this merit of −lnP(·), we choose Ω(M)= lmnn the following, we provide a general regularization to char- −lnP(M) and εz(·)=εz (·) in (2), which leads to the acterize distance matrix M to adapt more practical applica- following metric learning model: tions. This regularization is induced by matrix variate Gaus- ( Σ)= lmnn(M) − P( ) sian mixture distribution. M, , argmin εz λln M (∈ d×d First, as shown in Fig.1, we partition the matrix M s.t. M ∈M, k > 0, Σk ∈R+ ,k =1, 2, ···,R d×d R ) into p×q blocks, where any two matrices do not con- R (8) d ×d tain the same elements. For each block Muv (∈Ru v ), d ×d and k =1. we preserve its matrix form. Each block Muv (∈Ru v ) k=1 can be regarded as the local structure of M, so our strategy actually merges different regions of local structures of M. Optimization algorithm for solving model (8). Compar- Next, we use matrix variate Gaussian mixture distribution ing to the other metric learning models with matrix norm to fit each block. For the convenience, it is assumed that all regularizers, solving model (8) is very challenging, because blocks possess the same size. As a result, the Probability it needs to learn more model parameters and −lnP(·) is Σ π Density Function (PDF) of each block Muv has the follow- not separable for and . Here, we regard the objective ing form: (8) as a Q-function (Zheng, , and 2014) in Bayesian learning and propose a simple yet effective Expectation- R Maximization (EM) algorithm (Bishop 2007) to optimize it. P(Muv)= kN (Muv|0, Σk), (4) This is divided into three steps as follows: k=1 (1) Initialization. We need to initialize the distance ma- trix M, mixing coefficients set and covariance matrices set where R is the number of Gaussian components, k de- Σ. 0 R =1 notes the mixing weight with k > and k=1 k (2) E-step. In this step, we compute the conditional ex- and N (Muv|0, Σk) is the zero-mean matrix variate Gaus- pectation of latent variate zuv,k given Muv by the Bayes’ Σ sian distribution with k denoting the covariance matrix, rule, i.e., i.e., N ( | Σ ) = k Muv 0, k zuv,k R . (9) 1 1 T −1 N ( |0, Σ )= (− ( Σ )). j=1 jN (Muv|0, Σj) Muv k d d exp tr Muv k Muv u v dv 2 (2π) 2 |Σk| (3) M-step. We will find the optimal parameters Σ, and (5) M according to the current posterior probabilities. Finally, we assume all random matrix variates {Muv: u = Σ 1 2 ··· =1 2 ··· } Recalling our model (8), each optimal k can be obtained , , ,p,v , , ,q are independently and identically by solving the following problem: distributed. The PDF of M can be written as: p q R p q min − ln kN (Muv|0, Σk) . (10) P(M)= P(Muv). (6) Σ ∈Rd×d k + u=1 v=1 k=1 u=1 v=1 Calculating the derivative objective (10) with respect to The main task of traditional GMR is to search for a group Σk and setting it as 0, we have of mixing weights = {1,2, ···,R} and a group of covariance matrices Σ = {Σ1, Σ2, ···, ΣR} such that the p q 1     T PDF of the regression error is maximized. In our problem, Σk = p q ( zuv,kMuvMuv + ξIdu×dv ), zuv,k this means that the log likelihood function associated with u=1 v=1 u=1 v=1 the distance matrix M (11) where ξId ×d (ξ>0) is a perturbed term insuring Σk is p q R u v invertible. −lnP(M)=− ln kN (Muv|0, Σk) (7) To achieve the optimal k,k =1, 2, ···,R, we consider u=1 v=1 k=1 the optimization problem: is minimized. p q R R The above function not only considers the two- min − ln kN (Muv|0, Σk) +α( k−1), k>0 dimensional group structure of M, but also absorbs the ad- u=1 v=1 k=1 k=1 vantages of GMD. If we set R = p = q =1, π1 =1 (12) and Σ1 = Id×d, where Id×d denotes a d × d identity matrix, where α>0 is a Lagrangian multiplier. Therefore, − P( ) then ln M becomes the squared Frobenius-norm. As for p q other norms, e.g.,L1-norm, group sparsity norm and nuclear 1 k = zuv,k. (13) norm (Li et al. 2016), there exist the corresponding segmen- pq u=1 v=1 tation strategies on M and parameters and Σ such that −lnP(·) can well approximate them (Maz’ya and Schmidt For the variate M, we consider the following problem: 1996). Therefore, −lnP(·) is more general as compared to lmnn M = argmin εz (M) − λlnP(M) . (14) these matrix norm regularizers. M∈M

3724 It is difficult to find a closed-form solution of the problem Algorithm 1 Solving Model (8) via EM (14). However, we can compute a sub-gradient of objective Input: training samples X, parameters λ, components number (14) with regard to M: R and threshold value ξ. Initialization: covariance matrices set Σ = {Σ1, Σ2, ···, ΣR}  =(1− ) + ( − ) M μ Cij μ Cij Cij coefficients  = {1,2, ···,R} and the initial distance matrix i,ji (i,j,l)∈H M. (15) Σ  d(−λlnP(M)) Output: the final parameters , and distance matrix M. + , repeat dM 1. (E-step for zuv,k): Update the posterior probability zuv,k by T where Cij =(xi − xj)(xi − xj) and the set of triples H is old old old new k N (Muv |0, Σk ) defined as: (i, j, l) ∈Hif and only if the indices (i, j, l) trig- zuv,k ⇐  . R oldN ( old| , Σold) ger the hinge loss in the second part of Eq. (3). In addition, j=1 j Muv 0 j d(−λ P( )) ln M Σ Σ the each block of dM can be expressed as: 2. (M-step for k): Update each k by new 1 R −1 Σk ⇐ p q (W + ξIdu×dv ), d(−λlnP(M)) GDuv,kΣ zold = k=1 k u=1 v=1 uv,k R , (16) d uv M k=1 GDuv,k p q old old oldT where W = u=1 v=1 zuv,kMuv Muv . GD = k (− 1 ( T Σ−1 )). where uv,k dudv exp 2 tr Muv k Muv d (2π) 2 |Σk| v 3. (M-step for k): Update each k by Σ Then, when and are fixed, the optimal M can be found 1 p q by solving model (14) using subgradient method (Boyd, new ⇐ zold . k pq uv,k Xiao, and Mutapcic 2003): u=1 v=1

M ⇐ PRd×d ( M − γ  M), (17) 4. (M-step for M): Update M by using subgradient method (17) + to solve problem (14). where P d×d (·) denotes the projection operator and γ>0 until converge. R+ is a step size. The whole iterative procedure for solving problem (8) is summarized in Algorithm 1. Remark 2. Connecting (Boyd, Xiao, and Mutapcic 2003) Remark 1. Computing the sub-gradient (15) is extremely and Theorem 1, it is known that iteration (17) is convergent. expensive since the set of triples H is potentially very large. Before starting the generalization guarantee of model To circumvent this problem, we can use the similar tech- (14), some definitions and Lemmas needs to be introduced. nique as in (Weinberger and Saul 2009) to reduce the com- We assume that the instance space X is a compact convex putation complexity of (17). with respect to L2-norm, i.e., there exists a constant τ>0 There have been many results to show the convergence of such that ∀ x ∈X, x2 ≤ τ. Given a training sample T = { =( )}n EM algorithm. Here we refer the readers to (Gupta, Chen, zi xi,yi i=1 drawn i.i.d. from an unknown joint and others 2011)). In Algorithm 1, we use the following con- distribution P over the space Z, we denote by PT the set of vergence conditions: all possible pair built from T : new old new old PT = {(z1, z1), ···, (z1, zn), ···, (zn, zn)}. (19)  Σk − Σk F ≤ θ and |k − k |≤θ, (18) where θ is a sufficiently small positive number. For the convenience, we simplify triplet constraints in If the final distance matrix M is obtained by Algorithm 1, (14) as the pairwise constraints, and consider the more gen- eral metric learning model: then by Eq. (1), we can construct the Mahalanobis distance ⎛ ⎞ to compute similarity of two samples in the experiments. ⎝ 1 ⎠ M = argmin ω(M, zi, zj) − λlnP(M) , Theoretical Analysis M∈M |PT | (zi,zj )∈PT In this section, we investigate the convexity and the general- (20) ization ability of model (14). Theorem 1. The optimization problem (14) is convex. where |PT | denotes the element number of PT and the loss Proof. By (Parameswaran and Weinberger 2010), we lmnn function ω(M, zi, zj) depends on the input examples and know that εz (M) is convex. Thus, it suffices to show that T their labels. −lnP(M) is convex. Since each fk(M)=−tr(M ΣkM) x We call optimization problem (20) as the metric learning is a concave matrix function and the g(x)=e is algorithm A, which takes as input a finite set of pairs from monotonically increasing and convex, then each hk(M)= (Z×Z)n T and outputs a metric. −tr(M ΣkM) cke is a concave matrix function. Therefore, Definition 1. (Bellet 2013) The model (20) (or A)is n h1(M)+h2(M)+···+ hR(M) is still concave. Consider- (S, ζ(·))-robust for S ∈ N and ζ(·):(Z×Z) →R ing that −ln(·) is a monotonically decreasing function, it is if Z can be partitioned into S disjoints sets, denoted by S n known that the regularization −lnP(M) is convex.  {Ci}i=1, such that the following holds for all T∈Z :

3725 Databases 1NN SVM LMNN LMNN Trace LMNN Fantope LMNN Cap Our method FGNET 10.412 42.647 47.653 54.711 55.286 57.817 59.812 VINIR 33.474 68.850 69.750 69.832 73.600 73.731 74.428 OSR 65.474 75.850 75.315 76.021 76.471 76.511 77.493 PubFig 67.556 83.797 86.353 86.411 86.576 87.125 87.883

Table 1: The classification accuracies (%) of all methods on the FGNET Aging, VINIR, OSR, and PubFig databases

 ∀(z1, z2) ∈PT , z, z ∈Zand i, j ∈ [S]:if z1, z ∈ Ci and Combining (23) and (25), we have  z2, z ∈ Cj then     | ( [1 −D ∗ ( )]) − ( [1 −D ∗ ( )])| | ( ) − ( )|≤ (P ) g y1y2 M x1, x2 g y1y2 M x1, x2 ω MPT , z1, z2 ω MPT , z1, z2 ζ T , (21) 8Lε0τγ where MP is learned by model (20) and ζ(PT ) > 0. ≤ √ . T λ We further assume that ω(M, zi, zj)=g(yiyj[1 − (26) DM(xi, xj)]), where g(·) is nonnegative and Lipschitz con- tinuous with Lipschitz constant L. According to Definition 1, we can complete the proof of Lemma 1. If σk is the minimum eigenvalue of Σk, then  T T Theorem 2. tr(M ΣkM) ≥ σktr(M M). In this paper, this loss function ω(M, zi, zj) is assumed Proof. Let the eigenvalue decomposition of Σk T Δ [ T T (Δ − to be nonnegative and uniformly bounded by a constant B be Uk Uk, it can be seen that tr MM Uk k ω 1 T (B>0). Denote GP = ( , )∈P ω(M, zi, zj), σkId×d)Uk] ≥ 0. This implies that tr(M ΣkM) ≥ T |PT | zi zj T T = 8Lε√ 0τγ = |Y|N ( 2 X ·) ω = σktr(M M).  τ λ , K γ/ , , 2 and G (|Y|N ( 2 X ·) Theorem 2. Model (20) is γ/ , , 2 , Ez,z∼P [ω(M, zi, zj)], where E(·) defines the Expectation 8Lε√ 0τγ) N ( 2 X · ) 2 λ -robust, where γ/ , , 2 is the γ/ -covering function. Using Theorem 2 and Theorem 7.3 in (Bellet number of X (Bellet 2013), ε0 = εz(0) − λlnP(0) and 2013), we can easily obtain the generalization bound of γ>0. 0 1− model (20), i.e., for any δ> , with probability at least δ 1 Proof. ε( )= ω( , i, j) we have: Let M |PT | (zi,zj )∈PT M z z and M∗ be the optimal solution of model (20), then we have    ∗ ∗ ω ω 2Kln2+2ln(1/δ) (M ) − P( ) ≤ (0) − P( )= |G (MP ) − GP (MP )|≤ τ +2B , εz λln M εz λln 0 ε0, (22) T T T n (27) which leads to −lnP(M∗) ≤ ε0 . λ where n denotes the number of training samples. We partition Z as |Y|N (γ/2, X , ·2) sets such that if z and z belong to the same set then y = y and x − x≤γ.     Now, for z1, z2, z1, z2 ∈Z, if y1 = y1 and y2 = y2, then   Experiments x1 − x1≤γ and x2 − x2≤γ. Thus, In this section, five standard databases, including FGNET     |g(y1y2[1 −DM∗ (x1, x2)]) − g(y1y2[1 −DM∗ (x1, x2)])| Aging database (Lanitis, Taylor, and Cootes 2002), Visible  (VI) and Near-Infrared (NIR) Face Database (Shen et al. ≤L(x1 − x22MF x1 − x12 2011), OSR database (Parikh and Grauman 2011), PubFig +  −     −   x1 x2 2 M F x2 x2 2 database (Kumar et al. 2009) and LFW database (Huang   + x1 − x12MF x1 − x22 et al. 2007), are selected to evaluate the effectiveness of  our method. Especially, we implement image classification + x2 − x22MF x1 − x 2). 2 experiments on FGNET Aging, VI and NIR Face, OSR (23) and PubFig databases, and compare our method with kNN, SVM (Fan et al. 2008), LMNN (Weinberger and Saul 2009), σ = {σ1,σ2, ···,σk} On the other hand, denoting min and LMNN Trace (Huo, Nie, and Huang 2016), LMNN Fantope  = min{1,2, ···,k} and considering Lemma 1, we have (Law, Thome, and Cord 2014) and LMNN Cap (Huo, Nie,

T and Huang 2016). Meanwhile, we carry out face verification ∗ −σtr(M M) −lnP(M ) ≥−lnRe experiments on PubFig and LFW databases, and compare (24) = −lnR + σtr(MT M). our method with LMNN Cap, LMNN Fantope, KISSME (Koestinger et al. 2012), ITML (Davis et al. 2007), LDML Noticing −lnR ≥ 0, we further get (Guillaumin, Verbeek, and Schmid 2009), IDENTITY (Huo, Nie, and Huang 2016) and MAHAL (Huo, Nie, and Huang =0001 =10−5 1 ∗ T 2016). For our method, we set λ . ,ξ and − lnP(M ) ≥ tr(M M)=MF . (25) =5 σ R . For other compared methods, we follow the authors’

3726 (a) Results of 5 nearest neighbors when we query an image on (b) Results of 5 nearest neighbors when we query an image on OSR dataset. The first row shows the results of LMNN Cap, Pubfig dataset. The first row shows the results of LMNN Cap, and the second row shows the results of our method. and the second row shows the results of our method.

Figure 2: Results of 5 nearest neighbors when we query an image. Green line means this neighbor is in the same class with query image, and red line denotes they are different. suggestions to choose the optimal parameters. in four different settings, i.e., expression variations, pose variations and illumination/time variations. In my experi- Experiments on the FGNET Aging Database ment, 800 face images from 80 subjects on this database We implement an experiment on the FGNET Aging are adopted. Each subject has 10 face images which include Database. There are 1, 002 face images from 82 subjects five visible and five near-infrared face images. We manu- ally cropped the face portion of the image and then normal- in this database. Each subject has 6-18 face images at dif- × ferent ages. Each image is labeled by its chronological age. ized it to 32 32 pixels. Then, we choose randomly five face The ages are distributed in a wide range from 0 to 69. Be- images from each person as training samples, while the re- sides age variation, most of the age-progressive image se- maining five images are utilized as test samples. This exper- quences display other types of facial variations, such as sig- iment is repeated ten times. We calculate the average recog- nificant changes in pose, illumination, expression, etc.To nition rates (%) of all methods as shown in the third line adapt some metric learning methods, a subset of FGNET of Table 1. From this table, it is observed that the results database is chosen. It includes 680 face images from 68 of all methods are similar. Particularly, the result of SVM subjects. Each subject has 10 face images. We manually (68.850%) is encouraging. The classification accuracies of cropped the face portion of the image and then normalized other LMNN based methods are less than 74%, while the it to 32× 32 pixels. Then, we choose randomly five face proposed method achieves the highest accuracy: 74.428%. images from each person as training samples, while the re- This means that our method can effectively handle face im- maining five images are utilized as test samples. This ex- ages from two sensor types. periment is repeated ten times. We calculate the average recognition rates (%) and standard deviations of all meth- Experiments on the OSR Database k ods as shown in the second line of Table 1. For NN, we set In this experiment, we utilize Outdoor Scene Recognition k =1 as in (Huo, Nie, and Huang 2016). From Table 1, (OSR) dataset. It includes 2688 images from 8 scene cat- we can find that the results of some metric learning methods egories, which are described by high level attribute fea- with regularization terms such as LMNN Trace (54.711%), tures. The same experimental setting as (Huo, Nie, and LMNN Fantope (55.29%) and LMNN Cap (57.817%) per- Huang 2016) is adopted. That is, 30 images for each form better than the other methods. Since FGNET Aging category are chosen as training data, and other images database not only includes age variance, but also face pose are used as testing data. To verify the robustness of our changes, it is very challenging to correctly characterize the methods, we randomly select 30 images for each cate- distance matrix M. However, our method (59.812%) still has gory as training data, and other images are used as test- some advantages in handling these variances. As a result, ing data. This procedure is repeated 5 times. We compute comparing to simple matrix norm regularizers, the proposed the average accuracies of the compared methods, including regularization (7) can fit the distance matrix M better. kNN, SVM, LMNN, LMNN Trace norm, LMNN Fantope, LMNN capped norm and our method, which are listed in Experiments on the VI and NIR Face Database the forth line of Table 1. It is observed that the perfor- An experiment is performed on the Visible (VI) and Near- mance of 1NN is poor since it only involves an Euclidean Infrared (NIR) Face Database (Shen et al. 2011). The NIR metric. The results of the other methods associated with and VI database is collected using developed dual cam- LMNN are comparative, because they take advantage of era system from 215 subjects. Subjects faces were captured Mahalanobis distance, which admits arbitrary linear scal-

3727 ROC ROC 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 True Positive Rate (TPR) True Positive Rate (TPR)

0.3 0.3

CAP (0.845) CAP (0.817) FANTOPE (0.841) FANTOPE (0.814) 0.2 KISSME (0.844) 0.2 KISSME (0.806) ITML (0.833) ITML (0.794) LDML (0.835) LDML (0.797) 0.1 IDENTITY (0.783) 0.1 IDENTITY (0.675) MAHAL (0.817) MAHAL (0.748) OURMETHOD (0.848) OURMETHOD (0.821) 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (FPR) False Positive Rate (FPR)

(a) Pubfig dataset (b) Attribute Feature (LFW dataset) (c) SIFT Feature (LFW dataset)

Figure 3: ROC curves of face verification on Pubfig and LFW datasets ings and rotations of the feature space. Some regularized and Huang 2016), MAHAL (Huo, Nie, and Huang 2016) LMNN methods, e.g., LMNN Trace norm, LMNN Fantope and our method are 0.782, 0.781, 0.766, 0.725, 0.719 and and LMNN capped norm, are superior to LMNN. However, 0.784, respectively. Thus, how to fit the distance M has an our method (77.493%) performs better than other methods. important effect on the face verification performance. Accordingly, using Gaussian mixture distribution induced regularization to characterize distance matrix M is more ap- Experiments on the LFW Database propriate than matrix norm regularizers in real applications. In this subsection, the face verification experiments are car- Experiments on the PubFig Database ried out on the Labeled Faces in the Wild (LFW) dataset. It contains 13,233 unconstrained face images of 5749 indi- We design two experiments on the PubFig Face Dataset. viduals, and 1680 of these pictured people appear in two In the first experiment, we choose a subset of Pubfig or more distinct photos. We adopt two different feature rep- database, which includes 771 images from 8 face cate- resentations, i.e., LFW Attribute feature dataset and LFW gories. The experimental setting is similar to (Law, Thome, SIFT feature dataset. The experiment setting is similar to and Cord 2014). But here, a 512-dimensional DSIFT (- (Huo, Nie, and Huang 2016). We plot ROC curves of all ung and Hamarneh 2009) descriptor is used for the bet- methods, including LMNN Cap, Fantope, KISSME, ITML, ter performance. This experiment is run 5 times, where LDML, Identity, MAHAL in Figure 2(a) and 1(b). Mean- 30 images per person in training data are selected - while, the Equal Error Rate for each method is calculated domly each time, and the average classification accuracies and the 1-EER value is used to evaluate the performance are utilized as the evaluation criterion. We compare our of these methods. Figure 3(b) shows the results on LFW method with some related methods, such as 1NN, SVM, Attribute feature dataset. It is seen that Mahalanobis dis- LMNN, LMNN Trace, LMNN Fantope and LMNN Cap. tance based methods perform better than Euclidean distance. The results of all method are exhibited in the fifth line Comparing with Identity and Mahalanobis methods, the ad- of Table 1. It can be seen that our method achieves the vantage of KISSME is obvious. The performance of Ma- best result: 88.083%. Meanwhile, some methods based on halanobis distance with structural regularizers is competi- LMNN, including LMNN trace (86.411%), LMNN fantope tive. For example, LMNN Cap and Fantope reach 84.5% (86.576%) and LMNN cap (87.125%), also give the good and 84.1%, respectively. For SIFT Feature dataset, the sim- performance. The above results shows that our method is ef- ilar phenomenon can be found (see Figure 3(c)). In a word, fective to the face images under uncontrolled environment. our method consistently outperforms other methods. The second experiment focuses on face verification task using face verification benchmark dataset, which consists of 20,000 pairs of images of 140 people from PubFig Face Conclusions Dataset. The data is divided into 10 folds with mutually dis- In this paper, we propose a robust metric learning model joint sets of 14 people, and each fold contains 1,000 intra and with triplet constraints. Our method partitions the distance 1,000 extra-personal pairs. Our method is compared with matrix M into several blocks and uses Matrix Variate Gaus- some recent methods and the Equal Error Rates (EER) of all sian Mixture Distribution to fit each block. Due to the out- method are computed. The ROC curves of all methods are standing generalization of Gaussian Mixture Distribution, shown in Fig.3(a). It is evident that our method has a leading the proposed method can deal with more practical cases, performance. For example, the 1-EER values of LMNN Cap where the distribution of M is complex and irregular. The (Huo, Nie, and Huang 2016), Fantope, KISSME (Koestinger proposed model is solved via EM algorithm. Additionally, et al. 2012), ITML (Davis et al. 2007), LDML (Guillau- we provide the theoretical analysis for our method. Empir- min, Verbeek, and Schmid 2009), IDENTITY (Huo, Nie, ical experiments on several real-world datasets demonstrate

3728 the robustness of the proposed model. Law, M. T.; Thome, N.; and Cord, M. 2014. Fantope regulariza- tion in metric learning. In Proceedings of the IEEE Conference on Acknowledgement Computer Vision and Pattern Recognition, 1051–1058. Li, J.; Kong, Y.; Zhao, H.; Yang, J.; and , Y. 2016. Learning fast This work was partially supported by the following grants: low-rank projection for image classification. IEEE Transactions on NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628, Image Processing 25(10):4803–4814. NSF-IIS 1619308, NSF-IIS 1633753, NIH R01 AG049371. Li, Y.; Tian, X.; and Tao, D. 2016. Regularized large margin dis- tance metric learning. In Data Mining (ICDM), 2016 IEEE 16th References International Conference on, 1015–1022. IEEE. Lim, D.; McFee, B.; and Lanckriet, G. R. 2013. Robust structural Bellet, A. 2013. Supervised metric learning with generalization metric learning. In Proceedings of the 30th International Confer- guarantees. arXiv preprint arXiv:1307.4514. ence on Machine Learning (ICML-13), 615–623. Bishop, C. 2007. Pattern recognition and machine learning (infor- Lu, J.; Zhou, X.; Tan, Y.-P.; , Y.; and Zhou, J. 2014. mation science and statistics), 1st edn. 2006. corr. 2nd printing edn. Neighborhood repulsed metric learning for kinship verification. Springer, New York. IEEE transactions on pattern analysis and machine intelligence Boyd, S.; Xiao, L.; and Mutapcic, A. 2003. Subgradient meth- 36(2):331–345. ods. lecture notes of EE392o, Stanford University, Autumn Quarter Luo, L.; Yang, J.; Qian, J.; and Tai, Y. 2015. Nuclear-l1 norm 2004. joint regression for face reconstruction and recognition with mixed Cao, X.; Chen, Y.; Zhao, Q.; Meng, D.; Wang, Y.; Wang, D.; and noise. Pattern Recognition 48(12):3811–3824. , Z. 2015. Low-rank matrix factorization under general mix- Luo, L.; Chen, L.; Yang, J.; Qian, J.; and , B. 2016a. Tree- ture noise distributions. In Proceedings of the IEEE International structured nuclear norm approximation with applications to ro- Conference on Computer Vision, 1493–1501. bust face recognition. IEEE Transactions on Image Processing Cheung, W., and Hamarneh, G. 2009. n-sift: n-dimensional scale 25(12):5757–5767. invariant feature transform. IEEE Transactions on Image Process- Luo, L.; Yang, J.; Qian, J.; Tai, Y.; and Lu, G.-F. 2016b. Robust ing 18(9):2012–2021. image regression based on the extended matrix variate power ex- Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007. ponential distribution of dependent noise. IEEE Transactions on Information-theoretic metric learning. In Proceedings of the 24th Neural Networks and Learning Systems. international conference on Machine learning, 209–216. ACM. Maz’ya, V., and Schmidt, G. 1996. On approximate approxima- tions using gaussian kernels. IMA Journal of Numerical Analysis Fan, R.-E.; , K.-W.; Hsieh, C.-J.; Wang, X.-R.; and , C.-J. 16(1):13–29. 2008. Liblinear: A library for large linear classification. Journal of machine learning research 9(Aug):1871–1874. Mignon, A., and Jurie, F. 2012. Pcca: A new approach for dis- tance learning from sparse pairwise constraints. In Computer Vi- Guillaumin, M.; Verbeek, J.; and Schmid, C. 2009. Is that ? sion and Pattern Recognition (CVPR), 2012 IEEE Conference on, metric learning approaches for face identification. In Computer Vi- 2666–2672. IEEE. sion, 2009 IEEE 12th international conference on, 498–505. IEEE. Parameswaran, S., and Weinberger, K. Q. 2010. Large margin Gupta, M. R.; Chen, Y.; et al. 2011. Theory and use of the em algo- multi-task metric learning. In Advances in neural information pro- rithm. Foundations and TrendsR in Signal Processing 4(3):223– cessing systems, 1867–1875. 296. Parikh, D., and Grauman, K. 2011. Relative attributes. In Com- Huang, G. B.; Ramesh, M.; Berg, T.; and Learned-Miller, E. 2007. puter Vision (ICCV), 2011 IEEE International Conference on, 503– Labeled faces in the wild: A database for studying face recognition 510. IEEE. in unconstrained environments. Technical report, Technical Report Shen, L.; He, J.; , S.; and Zheng, S. 2011. Face recognition from 07-49, University of Massachusetts, Amherst. visible and near-infrared images using boosted directional binary Huo, Z.; Nie, F.; and Huang, H. 2016. Robust and effective met- code. In International Conference on Intelligent Computing, 404– ric learning using capped trace norm: Metric learning via capped 411. Springer. trace norm. In Proceedings of the 22nd ACM SIGKDD Inter- Wang, D.; Lu, H.; and Yang, M.-H. 2013. Least soft-threshold national Conference on Knowledge Discovery and Data Mining, squares tracking. In Proceedings of the IEEE conference on com- 1605–1614. ACM. puter vision and pattern recognition, 2371–2378. Koestinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P. M.; and Bischof, Weinberger, K. Q., and Saul, L. K. 2009. Distance metric learning H. 2012. Large scale metric learning from equivalence constraints. for large margin nearest neighbor classification. Journal of Ma- In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE chine Learning Research 10(Feb):207–244. Conference on, 2288–2295. IEEE. Ying, Y.; Huang, K.; and Campbell, C. 2009. Sparse metric learn- Kumar, N.; Berg, A. C.; Belhumeur, P. N.; and Nayar, S. K. 2009. ing via smooth optimization. In Advances in neural information Attribute and simile classifiers for face verification. In Computer processing systems, 2214–2222. Vision, 2009 IEEE 12th International Conference on, 365–372. Zheng, F., and Shao, L. 2016. Learning cross-view binary identities Kuznetsova, A.; Hwang, S. J.; Rosenhahn, B.; and Sigal, L. 2016. for fast person re-identification. In IJCAI, 2399–2406. Exploiting view-specific appearance similarities across classes for Zheng, J.; Liu, S.; and Ni, L. M. 2014. Robust bayesian inverse zero-shot pose prediction: A metric learning approach. In AAAI, reinforcement learning with sparse behavior noise. In AAAI, 2198– 3523–3529. 2205. Lanitis, A.; Taylor, C. J.; and Cootes, T. F. 2002. Toward automatic Zheng, F.; Tang, Y.; and Shao, L. 2016. Hetero-manifold regu- simulation of aging effects on face images. IEEE Transactions on larisation for cross-modal hashing. IEEE transactions on pattern Pattern Analysis and Machine Intelligence 24(4):442–455. analysis and machine intelligence.

3729