Distance Measures

CHAPTER 6 Distance Measures Background • One can calculate distances among either the The first step of most multivariate analyses is to rows of your data matrix or the columns of calculate a matrix of distances or similarities among a your data matrix. With community data this set of items in a multidimensional space. This is means you can calculate distances among your analogous to constructing the triangular "mileage sample units (SUs) in species space or among chart" provided with many road maps. But in our case, your species in sample space. we need to build a matrix of distances in hyperspace, Figure 6.1 shows two species as points in sample rather than the two-dimensional map space. Fortu space, corresponding to the tiny data set below (Table nately, it is just as easy to calculate distances in a 6.1). We can also represent sample units as points in multidimensional space as it is in a two-dimensional species space, as on the right side of Figure 6.1. using space. the same data set. This first step is extremely important. If There are many distance measures. A selection of information is ignored in this step, then it cannot be the most commonly used and most effective measures expressed in the results. Likewise, if noise or outliers are described below. It is important to know the are exaggerated by the distance measure, then these domain of acceptable data values for each distance unwanted features of our data will have undue measure (Table 6.2). Many distance measures are not influence on the results, perhaps obscuring meaningful compatible with negative numbers. Other distance patterns. measures assume that the data are proportions ranging between zero and one, inclusive Distance concepts Distance measures are flexible: Table 6.1. Example data set Abundance of two • Resemblance can be measured either as a species in two sample units. distance (dissimilarity) or a similarity. Species • Most distance measures can readily be con verted into similarities and vice-versa. Sample unit 1 2 • All of the distance measures described below A 1 4 can be applied to either binary (presence- B 5 2 absence) or quantitative data. 5 j Sam ple space Species space ш q ♦SUA 3 3 <L> Sp 2 SUB ex 2 tU CL C / 3 C / 3 -I------------------1------------------1------------------L_ » 1 2 3 4 5 1 2 3 Sample Unit A Species 1 Figure 6.1. Graphical representation of the data set in Table 6.1. The left-hand graph shows species as points in sample space. The rieht-hand nranh shnwc eamnip unite ·>- Chapter 6 Table 6.2 Reasonable and acceptable domains of input data. л\ and ranges of distance measures, d - fix). Domain Name (synonyms) of X Range of d=fix) Com ments Sorensen x > 0 0 < d < 1 proportion coefficient in city- (Brav & Curtis; (or()<x< 100%) block space, semimetric Czekanovvski) Relative Sorensen x > 0 0 <d< 1 proportion coefficient in city - (Kulczyński; Quantitative (or 0 < x < 100%) block space; same as Sorensen Symmetric) but data points relativized by sample unit totals; semimetric Jaccard x > 0 ()< £ /< 1 proportion coefficient in city- (orO <d< 100%) block space; metric Euclidean (Pythagorean) all non-negative metric Relative Euclidean all 0 < d <42 for quarter Euclidean distance between (Chord distance, points on unit hy persphere: hypersphere; 0 < d < 2 standardized Euclidean) metric for full hvpersphere Correlation distance all 0 < d < I converted from correlation to distance; proportional to arc distance between points on unit hypersphere; cosine of angle from centroid to points; metric Chi-square x > 0 d> 0 Euclidean but doubly weighted by variable and sample unit totals; metric Squared Euclidean all d> 0 metric Mahalanobis all d> 0 distance between groups weighted by within-group dispersion: metric Distance measures can be categorized as metric, Kulczyński distances Seimmctrics are extremely use scmimetric. or nonmetric A metric distance measure ful in community ecology but obey a non-Euclidean must satisfy the following rules: geometry Nonmetrics violate one or more of the other rules and are seldom used in ecology 1 The minimum value is zero when two items are identical. Distance measures 2 When two items differ, the distance is positive (negative distances are not allowed). The equations use the following conventions: Our data matrix A has q rows, which are sample units and 3 Symmetry: the distance from objects A to object p columns, which are species. Each element of the B is the same as the distance from B to A. matrix, a, ,, is the abundance of species j in sample unit 4 Triangle inequality axiom: With three objects. i. Most of the following distance measures can also be the distance between two of these objects used on binary data (1 or 0 for presence or absence). cannot be larger than the sum of the two other In each of the following equations, we are calculating distances. the distance between sample units / and h. Distance Λ le asu res Euclidean distance oo EUCLIDEAN E E ) , , и ' У ' ( α , , ~ ай ,]} Щ DISTANCE D ω Cu oo This formula is simply the Pythagorean theorem applied to p dimensions rather than the usual two dimensions (Fig. 6.2). SPE C IE S 1 City-block distance (= Manhattan distance) 00 ω 0 f CITY-BLOCK ω [ d i s t a n c e CBo a¡, " Ok,, α, ъ oo 7=1 ln city-block space you can only move along one dimension of the space at a time (Fig. 6.2). By SPE C IE S 1 analogy, in a city of rectangular blocks, you caimot cut diagonally through a block, but must walk along either of the two dimensions of the block. In the mathema tical space, size of the blocks does not affect distances cos α in the space Note also that many equal-length paths 00 exist between two points in city-block space. Euclidean distance and city-block distance are c e n t r o i d special cases (different values of k) of the Minkowski SPE C IE S 1 metric. In two dimensions: Figure 6.2. Geometric representations of basic dis Distance \[xk + yk tance measures between two sample units (A and B) in species space In the upper two graphs the axes meet where x and v are distances in each of two dimensions. at the origin; in the lowest graph, at the centroid Generalizing this to p dimensions, and using the form of the equation for ED: cos(18()“). Two sample units lying on the same radius from the centroid have r = 1 = cos(0°). If two sample Distance,h = at] - ahjt units form a right angle from the centroid then r = 0 = cos(90°). The correlation coefficient can be rescaled to a Note that k = 1 gives city-block distance, k = 2 gives distance measure of range 0-1 by Euclidean distance. As k increases, increasing emphasis is given to large differences in individual t distance ( Î ~ ~ dimensions Correlation Proportion coefficients The correlation coefficient (r) is cosine a (third Proportion coefficients are city-block distance panel in Fig. 6.2) where the origin of the coordinate measures expressed as proportions of the maximum system is the mean species composition of a sample distance possible. The Sorensen. Jaccard. and QSK tuut in species space (the "centroid"; see Fig. 6.2). If coefficients described below are all proportion coef the data have not been transformed and the origin is at ficients. One can represent proportion coefficients as (0,0). then this is a noncentered correlation coefficient the overlap between the area under curves. This is easiest to visualize with two curves of species abun For example, if two sample units lie at 180“ from dance on an environmental gradient (Fig. 6 3). If A is each other relative to the centroid, then r - -1 = the area under one curve, B is the area under the other. ( 'h apt u r 6 and n' is the overlap (intersection) of the two areas, (hen the Sorensen coefficient is 2и>/(. I · lí). The tin - Cthj Jaccard coefficient is will - И-w). Σ IX, Written in set notation: Σ». +Σ a. 2(A Г) B) Sorensen similarity Another way of writing this, where MÍN is the (A'u B) - ( ArsB) smaller of two values is: АглВ Jaccard similarity A s j B Clh,) Proportion coefficients as distance measures are D,h foreign to classical statistics, which are based on +Σ squared Euclidean distances. Ecologists latched onto proportion coefficients for their simple, intuitive appeal despite their falling outside of mainstream statistics. Nevertheless, Roberts (1986) showed how proportion One can convert this dissimilarity' (or any of the coefficients can be derived from the mathematics of following proportion coefficients) to a percentage dissi fuzzy sets an increasingly important branch of milarity (rø): mathematics For example, when applied to quantita tive data. Sorensen similarity is the intersection between two fuzzy sets. Sorensen distance ■=- BC, PD,i 100 D„ Sorensen similarity(also known as "BC" for Bray-Curtis coefficient) is thus shared abundance Jaccard dissimilarity is the proportion of the combined abundance that is not shared, or u (A ■ B w) (Jaccard 1901): -Σ a ¡j “ ci JD, а ч ^ ο ~ Qhj Environmental Gradient Σ + Σ ^ ”^Σ Figure 6.3. Overlap between two species abundances along an environmental gradient. The abundance shared between species A and B is shown by w. Quantitative symmetric dissimilarity(also known as the Kulczyński or QSK coefficient: see Faith et al. 1987): divided by total abundance. It is frequently known as "2u/(A+B)" for short. This logical-seeming measure was also proposed by Czekanowski (1913).

Distance Measures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support