RELIEF Algorithm and Similarity Learning for K-NN
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Information Systems and Industrial Management Applications. ISSN 2150-7988 Volume 4 (2012) pp. 445-458 c MIR Labs, www.mirlabs.net/ijcisim/index.html RELIEF Algorithm and Similarity Learning for k-NN Ali Mustafa Qamar1 and Eric Gaussier2 1 Assistant Professor, Department of Computing School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST), Islamabad, Pakistan [email protected] 2Laboratoire d’Informatique de Grenoble Universit´ede Grenoble France [email protected] Abstract: In this paper, we study the links between RELIEF, has paved the way for a new reasearch theme termed met- a well-known feature re-weighting algorithm and SiLA, a sim- ric learning. Most of the people working in this research ilarity learning algorithm. On one hand, SiLA is interested in area are more interested in learning a distance metric (see directly reducing the leave-one-out error or 0 − 1 loss by reduc- e.g. [1, 2, 3, 4]) as compared to a similarity one. However, ing the number of mistakes on unseen examples. On the other as argued by several researchers, similarities should be pre- hand, it has been shown that RELIEF could be seen as a distance ferred over distances on some of the data sets. Similarity is learning algorithm in which a linear utility function with maxi- usually preferred over the distance metric while dealing with mum margin was optimized. We first propose here a version of text, in which case the cosine similarity has been deemed this algorithm for similarity learning, called RBS (for RELIEF- more appropriate as compared to the various distance met- Based Similarity learning). As RELIEF, and unlike SiLA, RBS rics. Furthermore, studies reported in [5, 6, 7, 8] have proved does not try to optimize the leave-one-out error or 0 − 1 loss, that cosine should be preferred over the Euclidean distance and does not perform very well in practice, as we illustrate on over non-textual data sets as well. Furthermore, cosine sim- several UCI collections. We thus introduce a stricter version ilarity has been compared with the Euclidean distance on 15 of RBS, called sRBS, aiming at relying on a cost function closer different datasets. Umugwaneza and Zou [9] have combined to the 0 − 1 loss. Moreover, we also developed Positive, semi- cosine similarity and Euclidean distance for Trademarks re- definite (PSD) versions of RBS and sRBS algorithms, where the trieval whereby they fine tune the proportions for each of the learned similarity matrix is projected onto the set of PSD ma- two measures. Similarly Porwik et al. [10] have compared trices. Experiments conducted on several datasets illustrate the many different similarity and distance measures such as Eu- different behaviors of these algorithms for learning similarities clidean distance, Soergel distance, cosine similarity, Jaccard for kNN classification. The results indicate in particular that the and Dice coefficients etc. 0 − 1 loss is a more appropriate cost function than the one im- RELIEF (originally proposed by Kira and Rendell [11]) is an plicitly used by RELIEF. Furthermore, the projection onto the online feature reweighting algorithm successfully employed set of PSD matrices improves the results for RELIEF algorithm in various different settings. It learns a vector of weights only. for each of the different features or attributes describing their Keywords: similarity learning, RELIEF algorithm, positive, semi- importance. It has been proved by Sun and Wu [12] that it definite (PSD) matrices, SiLA algorithm, kNN classification, ma- implicitly aims at maximizing the margin of a linear utility chine learning function. SiLA [8] is a similarity metric learning algorithm for nearest neighbor classification. It aims at moving the nearest neigh- I. Introduction bors belonging to the same class nearer to the input example The k nearest neighbor (kNN) algorithm is a simple yet ef- (termed as target neighbors) while pushing away the nearest ficient classification algorithm: to classify an example x, it examples belonging to different classes (described as impos- finds its k nearest neighbors based on the distance or simi- tors). The similarity function between two examples x and y larity metric, from a set of already classified examples and can be written as: assigns x to the most represented class in the set of these xtAy nearest neighbors. Many people have improved the perfor- sA(x, y) = (1) mance of kNN algorithm by learning the underlying geom- N(x, y) etry of the space containing the data e.g. learning a Maha- where t represents the transpose, A is a (p×p) similarity ma- lanobis distance instead of the standard Euclidean one. This trix and N(x, y) is a normalization function which depends Dynamic Publishers, Inc., USA 446 Qamar and Gaussier on x and y (this normalization is typically used to map the Kronecker symbol), for a symmetric matrix, fml(x, y) = similarity function to a particular interval, as [0, 1]). Equa- xt y + xty m l l m , and for a square matrix (and hence, poten- tion 1 generalizes several standard similarity functions e.g. N(x,y) t the cosine measure which is widely used in text retrieval, is xmyl tially, an asymmetric similarity), fml(x, y) = . obtained by setting the A matrix to the identity matrix I, and N(x,y) N(x, y) to the product of the L2 norms of x and y. The aim here is to reduce the 0 − 1 loss which is dependent on the III. RELIEF and its mathematical interpreta- number of mistakes made during the classification phase. For tion the remainder of the paper, a matrix is sometimes represented as a vector as well (e.g. a p × p matrix can be represented by Sun and Wu [12] have shown shown that the RELIEF algo- a vector having p2 elements). rithm solves convex optimization problem while maximizing The rest of the paper is organized as follows: Section 2 de- a margin-based objective function employing the kNN algo- scribes the SiLA algorithm. This is followed by a short intro- rithm. It learns a vector of weights for each of the features, duction of RELIEF algorithm along with its mathematical in- based on the nearest hit (nearest example belonging to the terpretation, its comparison with SiLA⁀ and a RELIEF-Based class under consideration, also known as the nearest target Similarity learning algorithm (RBS) in section 3. Section 4 neighbor) and the nearest miss (nearest example belonging introduces a strict version of RBS followed by the experimen- to other classes, also known the the nearest impostor). tal results and conclusion. In the original setting for the RELIEF algorithm, it only learns a diagonal matrix. However, Sun and Wu [12] have learned a full distance metric matrix and have also proved II. SiLA - A Similarity Learning Algorithm that RELIEF is basically an online algorithm. The SiLA algorithm is described in detail here. It is a simi- In order to describe the RELIEF algorithm, we suppose that (i) p (i) larity algorithm and is a variant of the voted perceptron algo- x is a vector in R with y as the corresponding class la- rithm of Freund and Schapire [13], later used in Collins [14]. bel with values +1, −1. Furthermore, let A be a vector of weights initialized with 0. The weight vector learns the qual- ities of the various attributes. A is learned on a set of training SiLA - Training (k=1) (i) (1) (1) (n) (n) examples. Suppose an example x is randomly selected. Input: training set ((x , c ), ··· , (x , c )) of n (i) p Then two nearest neighbors of x are found: one from the vectors in R , number of epochs J; Aml denotes the element same class (termed as the nearest hit or H) while the second of A at row m and column l one from a class other than that of x(i) (termed as the near- Output: list of weighted (p × p) matrices est miss or M). The update rule in case of RELIEF does not ((A1, w1), ··· , (Aq, wq)) depend on any condition unlike SiLA. (1) Initialization t = 1,A = 0 (null matrix), w1 = 0 The RELIEF algorithm is presented next: Repeat J times (epochs) 1. for i = 1, ··· , n RELIEF (k=1) (i) (i) 2. if sA(x , y) − sA(x , z) ≤ 0 Input: training set ((x(1), c(1)), ··· , (x(n), c(n))) of n vec- 3. ∀(m, l), 1 ≤ m, l ≤ p, tors in Rp, number of epochs J; (t+1) (t) (i) (i) Aml = Aml + fml(x , y) − fml(x , z) Output: the vector A of estimations of the qualities of at- 4. wt+1 = 1 tributes 5. t = t + 1 Initialization 1 ≤ m ≤ p, Am = 0 6. else Repeat J times (epochs) (i) 7. wt = wt + 1 1. randomly select an instance x 2. find nearest hit H and nearest miss M Whenever an example x(i) is not separated from differently 3. for l = 1, ··· , p labeled examples, the current A matrix is updated by the dif- diff(l,x(i),H) diff(l,x(i),M) 4. Al = Al − + ference between the coordinates of the target neighbors (de- J J noted by y) and the impostors (represented by z) as described where J represents the number of times RELIEF has been in line 4 of the algorithm. This corresponds to the standard executed, while diff finds the difference between the values perceptron update. Similarly, when current A correctly clas- of an attribute l for the current example x(i) and the nearest sifies the example under focus, then its weight is increased hit H or the nearest miss M.