Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps∗

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps∗ Han-Jia Ye, De-Chuan Zhan, Xue-Min Si and Yuan Jiang National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Nanjing University, Nanjing, 210023, China fyehj, zhandc, sixm, [email protected] Abstract dissimilar ones are far away. Although ground-truth side information leads to a well-learned Mahalanobis metric [Verma Mahalanobis distance metric takes feature weights and Branson, 2015, Cao et al., 2016], it is in fact unknown and correlation into account in the distance com- during the training process. Therefore, side information is putation, which can improve the performance of often generated based on reliable raw features from various many similarity/dissimilarity based methods, such sources. For example, random choice [Davis et al., 2007], Eu- as kNN. Most existing distance metric learning clidean nearest neighbor selection [Weinberger et al., 2006], methods obtain metric based on the raw features and all-pair enumeration [Xing et al., 2003, Mao et al., 2016]. and side information but neglect the reliability of them. Noises or disturbances on instances will To reduce the uncertainty in side information, [Huang et al., make changes on their relationships, so as to af- 2010] and [Wang et al., 2012] try a selection strategy among fect the learned metric. In this paper, we claim that all target neighbors. While it is more reasonable to assume considering disturbance of instances may help the that there are inaccuracies in feature value collection, since metric learning approach get a robust metric, and the feature inaccuracies or noises will damage the structure propose the Distance metRIc learning Facilitated of neighbors, and consequently affect the reliableness of side by disTurbances (DRIFT) approach. In DRIFT, the information. From the aspect of this generative process, we noise or the disturbance of each instance is learned. tackle the unreliability in metric learning and propose the Dis- Therefore, the distance between each pair of (noisy) tance metRIc Facilitated by disTurbances (DRIFT) approach, instances can be better estimated, which facilitates using which a robust distance metric is achieved based on the side information utilization and metric learning. explicit consideration of instance disturbances. Experiments on prediction and visualization clearly In DRIFT, all possible variations of disturbances on in- indicate the effectiveness of DRIFT. stances are involved in the expected distance, so as to form different side information constraints as well as assign reasonable weights on them. Specifically, when a pair of noisy 1 Introduction instances meets the requirement of the provided side infor- Similarity and dissimilarity are widely used in machine learn- mation, the DRIFT’s learned metric should tolerate perturba- ing area, such as classification [Bian and Tao, 2011, Luo et al., tions by enlarging the similarity region. As such the robust- ] [ et al. et al. 2016 , clustering Xing , 2003, Xiang , 2008, Law ness of metric will increase and the generalization ability can et al., 2016b] and retrieval [McFee and Lanckriet, 2010]. The goal of Distance Metric Learning (DML) is to find a better be improved. On the contrary, if the side information was distance computation which can perform better than the Eu- hard to satisfy for the concerned pair, assigning obvious per- clidean one. Given a positive semi-definite matrix M, the turbations can be risky. Hence, the tolerance level of dis- (squared) Mahalanobis distance between two instances xi and turbances on instance pairs reflects the reliableness of side xj can be defined as: information to some extent. Moreover, perturbation distribu- > dist2 (x ; x ) = (x − x ) M(x − x ) : tion modeled noises make DRIFT have the ability to represent M i j i j i j instances distribution quantitatively [Van Der Maaten et al., Since it considers the relationship between different types 2013], and help reduce the effects of incorrect guidance in of features [Lim et al., 2013, Ye et al., 2016b], its advan- training. Therefore, it is expected that DRIFT can provide a tages have been discovered and validated from various per- robust distance metric with better discriminative ability. spectives [Kulis, 2012, Bellet et al., 2015]. DRIFT learns metric and disturbance of instances jointly. To train a Mahalanobis distance metric, various types of Benefited from metric decomposition, we get a simplified ob- [ ] side information Law et al., 2016a should be collected to jective and acceleration variants with sub-problems further provide a direction for distance relative comparisons. Af- reducing to scalar group optimization. Our empirical inves- ter searching a metric decreasing the violation of these con- tigations provide visualization effects demonstrating the in- straints, similar instances become close to each other while terpretability of DRIFT. Real-world tasks validate DRIFT’s ∗This work was supported by NSFC (61632004, 61673201). superiorities on generalization and robustness, especially in 3315 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) f t t t g t t the case of unreliable instances/side information. xi; xj ; xk , xj is the target neighbor of instance xi and The rest of this paper starts with discussions about related they should be close to each other using learned distance. t methods. Then the DRIFT approach is presented in detail. While xk is the imposter, i.e., a different class instance that The last are experiments and conclusion. needed to be pushed away. The learned Mahalanobis distance S+ metric M lies in the set of positive semi-definite matrix d . k k2 h i > 2 Related Work M F = M; M = Tr(M M) is the Frobenius norm of a Mahalanobis distance is widely researched in distance metric matrix. I is the identity matrix. [·]+ is a scalar input function learning. It is originally used in the clustering task [Xing et which only preserves the non-negative part of input value. al., 2003] considering all pair comparisons. ITML [Davis et We use P to denote the set of valid probability distributions al., 2007] utilizes the randomly chosen pairwise side infor- (nonnegative and sum to one over random variable space). mation and information based regularizer. While triplets con- Denote pi(ϵ) 2 P as the perturbation distribution for in- f gN straints are considered in LMNN [Weinberger et al., 2006] to stance xi and p = pi(ϵ) i=1 is the set of all these distri- form a large margin objective. To find a better description of butions. For random variable ϵ 2 Rd, the KL-divergence side information, a multi-stage strategy is proposed in [Wein- can produce a non-negative inconsistency measurement be- berger and Saul, 2009, Zhan et al., 2009], where the metric tween two distributionsR p(ϵ) and p0(ϵ), which is defined as p(ϵ) learned in the previous stage is used to find nearest neigh- KL(pkp0) = p(ϵ) log dϵ. bors in the current one. [Huang et al., 2010] and [Wang et p0(ϵ) al., 2012] traverse all target neighbors to find best candidates. In DRIFT, we propose a new perspective on the refinement 3.2 Instance Disturbances in Metric Learning of side information by considering the disturbances over in- Instance disturbance affects its neighborhood structures, instances. Different metric learning methods and the ways they ducing unreliability in training, which can be used for facil- use side information can be found in [Bellet et al., 2015]. itating the utilization of side information. Taking perturbations into account in the distance computation, variants of in- Perturbations modeling can be regarded as a type of regu- stances should be used to explain the guidance of side infor- [ ] [ larization Wager et al., 2013 to train a robust model Chen et mation. In DRIFT, We focus on the expected Mahalanobis dis- al., 2014, Wangni and Chen, 2016] or get better feature repre- tance with metric M between two instances xi and xj, which sentations [Van Der Maaten et al., 2013, Chen et al., 2015, Li is equivalent to covering all the instances xî and x^j sampled ] [ ] et al., 2016 . Qian et al. 2014 first consider noises in met- from instance distribution p(xi) and p(xj), respectively [Li ric learning, but only fixed covariance perturbation is used to et al., 2016, Mao et al., 2016]: get a low rank solution. In DRIFT, we learn the perturbation h i E 2 E − > − distribution to directly model the noises for a robust metric. xî;x^j [distM (xî; x^j )] = xî;x^j (xî x^j ) M(xî x^j ) The disturbance distribution is also closely related to the ZZ > > > instance distributions, and consequently correlated with the = xî Mxî + x^j Mx^j − 2x^j Mxî p(xî)p(x^j )dxîdx^j instance generation mechanism. Different from [Ye et al., E > E > − E > E 2016a], where distributions are considered to model the multi- = xî [xî Mxî] + x^j [x^j Mx^j ] 2 xî [xî] M x^j [x^j ] : (1) ple metrics and indirectly infer the metric for unseen instance, Last step in Eq. 1 comes from the independent assumption DRIFT explicitly models the distribution related to instances between instances xi and xj . Since it is a general assump- and side information. Mao et al. [2016] study robust mani- tion that the disturbances on instances are centralized, i.e., fold learning. Nevertheless, they focus on the instance distri- E xî [xî] = xi, the above expected distance can be further bution towards preserving their Euclidean distances. On the transformed into: contrary, DRIFT approach considers the disturbance distribu- E 2 2 h i xî;x^j [distM (xî; x^j )] = distM (xi; xj )+ M; Cov[xi]+Cov[xj ] : tion directly for better discriminative ability.

Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps∗

Deep Similarity Learning for Sports Team Ranking

RELIEF Algorithm and Similarity Learning for K-NN

Deep Metric Learning: a Survey

Cross-Domain Visual Matching Via Generalized Similarity Measure and Feature Learning

Low-Rank Similarity Metric Learning in High Dimensions

Unsupervised Domain Adaptation with Similarity Learning

Algorithms for Similarity Relation Learning from High Dimensional Data Phd Dissertation

Supervised Learning with Similarity Functions

Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features

Coupled Attribute Similarity Learning on Categorical Data Can Wang, Xiangjun Dong, Fei Zhou, Longbing Cao, Senior Member, IEEE, and Chi-Hung Chi

Learning Similarities for Linear Classification: Theoretical

Arsknn-A K-NN Classifier Using Mass Based Similarity Measure