Experimental Analysis of Crisp Similarity and Distance Measures
Total Page:16
File Type:pdf, Size:1020Kb
International Conference of Soft Computing and Pattern Recognition Experimental Analysis of Crisp Similarity and Distance Measures Leila Baccour Robert I. John REGIM-Lab.: REsearch Groups in Intelligent Machines, Automated Scheduling, University of Sfax, Optimisation and Planning (ASAP) group, ENIS, BP 1173, Sfax, 3038, Tunisia. Computer Science Email: [email protected] Jubilee Campus, Wollaton Road Nottingham, NG8 1BB, UK Email: [email protected] Abstract—Distance measures are a requirement in many clas- This function is called distance, index of distance, index of sification systems. Euclidean distance is the most commonly used dissimilarity, semi-metric distance, or distance ultra-metric in these systems. However, there exists a huge number of distance according to one of the following cases: and similarity measures. In this paper we present an experimental index of dissimilarity: if the function d satisfies the analysis of crisp distance and similarity measures applied to • shapes classification and Arabic sentences recognition. properties 1 and 2. index of distance or semi-metric distance: if the function Keywords-Similarity measures, distance measures, crisp sets, • classification of shapes, classification of Arabic sentences d satisfies the properties 1, 2 and 3 distance: if the function d satisfies the properties 1, 2, 3 • I. INTRODUCTION and 4 distance ultra-metric: if the function d verifies the prop- In some applications such as recognition, classification and • clustering, it is necessary to compare two objects described erty x, y, z E,d(x, y) max (d (x, z) ,d(z,y)) ∀ ∈ ≤ with vectors of features. This operation is based on a distance Two patterns x and y are close if the distance d(x, y) tends or a similarity measure. There are many such measures de- to 0, d(x, y)=0imply a complete similarity between x and ployed in such diverse fields as psychology, sociology (e.g. y. Inversely, x and y are close, if the measure of similarity is comparing test results), medicine (e.g. comparing parameters high, S(x, y)=0imply a complete dissimilarity between x of patients), economics (e.g. comparing balance sheet ratios), and y. The choice of distance or similarity measures depends etc. The characteristics of the data can be very different. Essen- on the application type and on the objects description. tially, data can be discrete (i.e. binary or real) or continuous, nominal or numeric. The application of a measure depends on III. SIMILARITY AND DISTANCE MEASURES FOR the type of the objects to be compared and the data describing DISCRETE DATA them. In this paper we are interested in similarity and distance measures used in classification and clustering of objects in In the following we give the mathematical definitions of general. Our intention is to show the importance of the choice distances from literature to measure the closeness between two samples x(x ...x ) and y(y ...y ) E having n numeric of distance or similarity measures in classification systems. 1 n 1 n ∈ Thus, measures from literature are applied to classification of attributes. two data sets using the KNN classifier. The obtained results 1) Minkowski Distances: The Minkowski distance [1] is a are analyzed and conclusions are given. generalized metric, also said Lp norm, between two samples In the next section, we detail properties of distance and x and y. It is defined by the following formula: similarity measures between crisp data. In section three, we 1 n p present measures between discrete data. In section four and d (x, y)= x y p (1) five we apply respectively similarity and distance measures to MK | i − i| !i=1 # shapes classification and Arabic sentences recognition. " where p>0 is the order of the Minkowski metric. II. PROPERTIES OF CRISP DISTANCE AND SIMILARITY This distance is the generalization of the following distances MEASURES (2, 3) [2]. According to the value of the parameter p, we have A distance measure between two crisp vectors x and y, in one of the following distances: a discourse universe E, is defined as a function d : E2 R+. → p =1: Manhattan distance, also said L1-Norm, Taxicab For all x, y, z E, this function is required to satisfy the • ∈ norm, rectilinear distance or City-Block distance. Man- following conditions: hattan Distance between x and y is defined as: 1) minimality: x X, d(x, x)=0 ∀ ∈ n 2) identity: x, y X, d(x, y)=0 x = y ∀ ∈ ⇒ d (x, y)= x y (2) 3) symmetry: x, y X, d(x, y)=d(y, x) Ma | i − i| ∀ ∈ i=1 4) triangular inequality: x, y, z X, d(x, y) d(x, z)+ " ∀ ∈ ≤ d(z,y) This distance degenerates to the Hamming distance [3]. 978-1-4799-5934-1/14/$31.00 ©2014 IEEE 96 p =2: Euclidean distance , L2-Norm or Ruler distance samples x and y are entirely 0. In this case the denomina- • is the ”ordinary” distance between two points and has the tor becomes 0 and [8] suggest to use a zero-adjusted Bray- following formula: Curtis coefficient and proposed this modified formula: n x y n i=1 i i dBC (x, y)= | − | (9) 2 M k d (x, y)= (x y ) (3) 2+ (xi + yi) E $ i − i ' i=1 %i=1 %" Lorentzian Distance defined as' [9]: & • If the square root is omitted the obtained formula will be n d = ln(1 + x y ) (10) that of Euclidean Squared distance. Lo | i − i| i=1 p = : Chebyshev distance [4], [5] is also called " • ∞ maximum value distance, chessboard distance in 2D or 1isaddedtoguaranteethenon-negativitypropertyand the minimax approximation. to avoid the log of zero. Canberra Distance: • d (x, y)=max x y (4) It was introduced in 1966 [10] and has the following Ch i | i − i| formula: n 2) Spearman Distance: Spearman distance is the square of xi yi d (x, y)= | − | (11) Euclidean distance between two rank vectors defined as: C x + y i=1 | i i| n " dC (11) is slightly modified by the following formula d (x, y)= (x y )2 (5) S i − i suggested in [11]. i=1 " n xi yi 3) Hamming Distance: The Hamming distance [6] is a dC (x, y)= | − | (12) x + y i=1 i i metric expressing the distance between two samples by the " | | | | number of mismatches of their pairs of attributes. It is mainly The Canberra metric is mainly used for positive values. used for nominal data, string and bitwise analyses, but can also This metric is very sensitive for values close to 0, where be useful for numerical variables. This can be found by using it is more sensitive to proportional than to absolute dif- the XOR operator for the corresponding bits or equivalent. ferences [11]. This characteristic becomes more apparent The Hamming distance is an important measurement for in higher dimensional space, and respectively with an the detection of errors in information transmission [6]. The increasing number of variables. corresponding formula is: 5) Inner Product Similarities: n Inner product similarity: • d (x, y)= x = y (6) n HA i ( i i=1 S = X Y = x y (13) " IP · i i i=1 4) L1 Distances: " where X Y is the scalar product between X and Y . The Bray-Curtis or Sorensen distance: The non-metric Bray- · • inner product is sometimes called the scalar product or Curtis dissimilarity [7] is one of the most commonly dot product [12]. If it used for binary vectors, it is called applied measurements to express relationships in ecology, the number of matches or the overlaps. environmental sciences and related fields. Harmonic mean similarity: Bray-Curtis is a modified Manhattan measurement, where • n the summed differences between the attributes values of xiyi S =2 (14) the samples x and y are standardized by their summed HM x + y i=1 i i attributes values. The general equation of Bray-Curtis " Cosine similarity is the normalized inner product, called dissimilarity is: • the cosine coefficient because it measures the angle be- n i=1 xi yi tween two vectors. Sometimes called angular metric [9]. dBC(x, y)= n | − | (7) Other names for the cosine coefficient include Ochiai [9], i=1 (xi + yi) ' [13] and Carbo [13]. Or equivalent ' n x y S = i=1 i i (15) n Cos n 2 n 2 i=1 xi yi i=1 xi i=1 yi dBC(x, y)= n | −n | (8) ' i=1 xi + i=1 yi Jaccard Similarity: The Jaccard index, known as the Jac- ' • (' (' card similarity coefficient [14] is widely used in various The Bray-Curtis similarity' d is' a slightly modified BC fields such as ecology and biology. It is defined as: equation. It can be directly calculated from the dissim- n ilarity value: SBC =1 dBC i=1 xiyi − SJac = (16) The result is undefined, when the variables among two n x2 + n y2 n x y i=1 i 'i=1 i − i=1 i i ' ' ' 97 Jaccard distance is computed as 1 S , has the follow- formula for modified Hausdorff Distance concerns the distance − Jac ing formula: h (22) which is substituted by a distance g as follows: n 2 1 (xi yi) d = i=1 − (17) g(x, y)= min d1(xi,yi) (24) Jac n 2 n 2 n x yi y xi + yi xiyi xi x ∈ i=1 ' i=1 − i=1 | | "∈ Jaccard similarity' is the same' as Kumar-Hassebrook' sim- where d1 is any distance and x the cardinality of the set x. | | ilarity. Thus, modified Hausdorff distance becomes: Dice Similarity is defined in [15] as: • dMHd(x, y)=max (g (x, y) ,g(y, x)) (25) n 2 i=1 xiyi SDice = n n (18) 9) Bhattacharyya Distance: The Bhattacharyya distance x2 + y2 i=1'i i=1 i [24] is defined by the following formula: known as the Dice coefficient' [16].' The distance of Dice n is computed as 1 SDice, does not validate the property d (x, y)= ln √x y (26) − Bh − i i of triangle inequality and has the following formula: i=1 " n (x y )2 The Bhattacharyya distance has a value between 0 and 1.