Measures of Dispersion for Multidimensional Data
Total Page:16
File Type:pdf, Size:1020Kb
European Journal of Operational Research 251 (2016) 930–937 Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor Computational Intelligence and Information Management Measures of dispersion for multidimensional data Adam Kołacz a, Przemysław Grzegorzewski a,b,∗ a Faculty of Mathematics and Computer Science, Warsaw University of Technology, Koszykowa 75, Warsaw 00–662, Poland b Systems Research Institute, Polish Academy of Sciences, Newelska 6, Warsaw 01–447, Poland article info abstract Article history: We propose an axiomatic definition of a dispersion measure that could be applied for any finite sample of Received 22 February 2015 k-dimensional real observations. Next we introduce a taxonomy of the dispersion measures based on the Accepted 4 January 2016 possible behavior of these measures with respect to new upcoming observations. This way we get two Available online 11 January 2016 classes of unstable and absorptive dispersion measures. We examine their properties and illustrate them Keywords: by examples. We also consider a relationship between multidimensional dispersion measures and mul- Descriptive statistics tidistances. Moreover, we examine new interesting properties of some well-known dispersion measures Dispersion for one-dimensional data like the interquartile range and a sample variance. Interquartile range © 2016 Elsevier B.V. All rights reserved. Multidistance Spread 1. Introduction are intended for use. It is also worth mentioning that several terms are used in the literature as regards dispersion measures like mea- Various summary statistics are always applied wherever deci- sures of variability, scatter, spread or scale. Some authors reserve sions are based on sample data. The main goal of those characteris- the notion of the dispersion measure only to those cases when tics is to deliver a synthetic information on basic features of a data variability is considered relative to a given fixed point (like a sam- set under study. It seems that the most commonly used summary ple variance which averages squared deviation of the data points statistics are central tendency measures (like the mean, median, from their mean) and then use the term spread as a more general mode, etc.) indicating a typical behavior of the examined variable. one (see Bickel & Lehmann (1976, 1979); Wilcox (2005)). However, However, no measure of central tendency can reveal the whole such distinction in terminology is neither consistent nor commonly picture of a variable. Indeed, two or more samples may have the accepted. Thus in our paper we do not attach importance to such same mean (or other central tendency) although they differ sig- distinctions. nificantly. Therefore, besides central tendency a dispersion of ob- Some of the considered tools measure the absolute spread (like servations in a sample is also of interest. Moreover, in many cases those mentioned before), while the other indicate the relative scat- we have to monitor variability as carefully as the location param- ter (e.g. the coefficient of variation or Gini coefficient). Most of eters. As a typical example let us consider the Statistical Process them are dedicated to quantitative data (ratio scale) but one can Control where no alarm signal found on the X¯ -chart cannot be au- found also a few that might be used to characterize qualitative ob- tomatically interpreted as the process is under control until the servations (nominal scale). S-chart (or R-chart) confirms no alarm caused by the increase of What is interesting is that almost all well-known measures variability. of dispersion could be used only for one-dimensional data. It is Many tools have been proposed to characterize dispersion, like rather inconvenient especially that most of the contemporary data the range, interquartile range, sample variance, standard deviation sets available and processed in practice is multidimensional. Of an so on. They differ in construction, properties and situations they course, having such multidimensional data set one may apply uni- variate dispersion measures to each variable separately, but this way we loose information on possible relations between vari- ∗ ables. Then, as a possible remedium, one may consider e.g. a co- Corresponding author at: Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01–447 Warsaw, Poland. Tel.: +48 223810207; variance matrix which delivers both variances of all single vari- fax: +48 223810105. ables and covariances for all pairs of variables. Hence, having a E-mail addresses: [email protected] (A. Kołacz), [email protected] data set of k-dimensional observations we get a matrix of k2 (P. Grzegorzewski). numbers instead of a single real value of a desired measure of http://dx.doi.org/10.1016/j.ejor.2016.01.011 0377-2217/© 2016 Elsevier B.V. All rights reserved. A. Kołacz, P. Grzegorzewski / European Journal of Operational Research 251 (2016) 930–937 931 dispersion characterizing somehow the whole multidimensional Usually adding another observation to a data set under study sample. we expect changes in the dispersion measure value, no matter Keeping in mind all the remarks mentioned above we propose where the new point is located. However, there also exists a class a general definition of a dispersion measure that could be ap- of measures for which by adding new observations we do not plied for any finite sample of k-dimensional real observations, i.e. change the scatter of the data set (provided those observations be- k x1,...,xn ∈ R . Next we examine basic properties of so defined long to some area). To clarify the situation in further sections we measures and illustrate them by examples. We also consider the indicate two important subfamilies of dispersion measures. relationship between multidimensional dispersion measures and multidistances introduced by Martín and Mayor (2009, 2011). 3. Unstable dispersion measures Recently, Gagolewski (2015) considered the dispersion measures from the aggregation theory point of view. He showed that al- ∞ n Definition 3.1. A measure of dispersion : Rk → [0, ∞) though aggregation theory mainly focuses on central tendency n=1 is called unstable if measures (see Beliakov, Pradera, & Calvo (2007); Calvo, Mayor, and Mesiar (2002); Grabisch, Marichal, Mesiar, and Pap (2009)), it (x1,...,xn, xn+1 ) = (x1,...,xn ), (1) may deliver an interesting insight to measures of spread of one- k for almost all x + ∈ R . dimensional quantitative data. In our case we show that some con- n 1 siderations on general multidimensional dispersion measures may In other words, for any unstable dispersion measure and any also lead to some interesting conclusions for one-dimensional data data set there exist a set which has the k-dimensional Lebesgue sets. measure zero and such that joining any its point to the data set do The paper is organized as follows: In Section 2 we present not change a value of the dispersion obtained for the initial data the desired requirements each measure of dispersion should sat- set. Let us now discuss some examples and basic properties of the isfy. Next, we distinguish two basic types of dispersion mea- unstable dispersion measures. sures: unstable and absorptive dispersion measures (Section 3 and ,..., ∈ R Section 4, respectively). Section 5 is devoted to some interesting Example 3.2. A one-dimensional sample, i.e. x1 xn pro- properties of the interquartile range that appear in practice when vides many examples of well-known unstable dispersion mea- sures, like different sample variances: s2 = 1 n (x − x¯)2, s2 = we try to estimate it from data. In Section 6 we prove a theo- n−1 i=1 i b 1 n ( − )2 rem showing a relation between unstable and absorptive disper- n i=1 xi x¯ or corresponding sample standard deviation. sion measures. Finally, in Section 7 we examine the relationship Example 3.3. Having a sample x ,...,x ∈ Rk let us define the fol- between dispersion measures and multidistances. 1 n lowing function n n 2. Dispersion measures ( ,..., ) = 2( , ), Ge x1 xn de xi x j (2) i=1 j=1 Consider a sample of n observations from the k-dimensional k Rk real space, i.e. x1,...,xn ∈ R . Descriptive statistics, also called where de(xi, xj) denotes the Euclidean distance in .Itcanbe summary statistics, provide various measured describing different shown that (2) is an unstable dispersion measure. To prove it, let aspects of the underlying data. Besides central tendency measures, us firstly assume that A = n (xm − x¯m )2, where xm denotes the m i=1 i i thenextgroupofthemostusefulsummarystatisticsisformed m = 1 n m mth component of xi and x¯ n i=1 xi .Thenforanyj we get by measures of dispersion. Although each person has some intu- n ition about measures of dispersion, it seems that a formal defini- 2 A = xm − xm + xm − x¯m tion would be desirable. m i j j i=1 ∞ n Definition 2.1. Afunction : Rk → [0, ∞) is called a n n n=1 = m − m 2 + m − m m − m measure of dispersion if is not identically zero function which xi x j 2 xi x j x j x¯ k i=1 i=1 satisfies the following axioms for any x1,...,xn ∈ R : n = m − m 2 − m − m 2. (A1) (x,...,x) = 0 xi x j n x j x¯ (A2) is symmetric, i.e. i=1 n m m 2 ( ,..., ) = ( ,..., ) Summing up both sides over j we get nAm = , = (x − x ) − xπ(1) xπ(n) x1 xn i j 1 i j nAm, which implies that for any permutation π : {1,...,n} → {1,...,n}, n n 1 2 1 2 (A3) is translation invariant, i.e. A = xm − xm = xm − xm m 2n i j n i j i=1 j=1 1i< jn (x1 + a,...,xn + a) = (x1,...,xn ) and for any a ∈ Rk, (A4) is rotation invariant, i.e.