Submitted to the Annals of Statistics
THE DISTANCE STANDARD DEVIATION
By Dominic Edelmann†, Donald Richards, and Daniel Vogel German Cancer Research Center, Pennsylvania State University, and University of Aberdeen
The distance standard deviation, which arises in distance corre- lation analysis of multivariate data, is studied as a measure of spread. The asymptotic distribution of the empirical distance standard de- viation is derived under the assumption of finite second moments. Applications are provided to hypothesis testing on a data set from materials science and to multivariate statistical quality control. The distance standard deviation is compared to classical scale measures for inference on the spread of heavy-tailed distributions. Inequalities for the distance variance are derived, proving that the distance stan- dard deviation is bounded above by the classical standard deviation and by Gini’s mean difference. New expressions for the distance stan- dard deviation are obtained in terms of Gini’s mean difference and the moments of spacings of order statistics. It is also shown that the distance standard deviation satisfies the axiomatic properties of a measure of spread.
1. Introduction. In recent years, the topic of distance correlation has been prominent in statistical analyses of dependence between multivariate data sets. The concept of distance correlation was defined by Sz´ekely, Rizzo, and Bakirov [17] and Sz´ekely and Rizzo [41], and they applied distance corre- lation methods to testing independence and measuring association between collections of random vectors. Since the appearance of the papers [17, 41], enormous interest in the theory and applications of distance correlation has arisen. We refer to the articles [36, 18, 43] on statistical inference, [15, 16, 24, 52] on time series, [2, 11,3] on affinely invariant distance correlation and connections with sin- gular integrals, [28] on metric spaces, and [37] on machine learning. Distance correlation methods have also been applied to assessing associations between arXiv:1705.05777v2 [math.ST] 11 Dec 2019 familial relationships, lifestyle factors, diseases, and mortality [27], and to detecting associations in large astrophysical databases [29, 35].
†Corresponding author. MSC 2010 subject classifications: Primary 60E15, 62H20; Secondary 60E05, 60E10. Keywords and phrases: asymptotic efficiency, characteristic function, dispersive order- ing, distance correlation coefficient, distance variance, Gini’s mean difference, measure of spread, order statistic, sample spacing, stochastic ordering, U-statistic.
1 2 D. EDELMANN, D. RICHARDS, AND D. VOGEL
For z ∈ C, denote by |z| the modulus of z. For a positive integer p and p p s, x ∈ R , denote by hs, xi the Euclidean inner product on R and by ksk = hs, si1/2 the corresponding Euclidean norm. We also define the constant
π(p+1)/2 cp = . Γ (p + 1)/2
p q For random vectors X ∈ R and Y ∈ R , let √ fX,Y (s, t) = E exp −1(hs, Xi + ht, Y i) ,
p q where s ∈ R , t ∈ R , be the joint characteristic function of (X,Y ) and let fX (s) = fX,Y (s, 0) and fY (t) = fX,Y (0, t) be the corresponding marginal characteristic functions. The distance covariance between X and Y is defined as the nonnegative square root of Z 2 1 2 ds dt (1.1) V (X,Y ) = fX,Y (s, t) − fX (s)fY (t) p+1 q+1 , cpcq Rp+q ksk ktk the distance variance is defined as (1.2) Z 2 2 1 2 ds dt V (X) = V (X,X) = 2 fX (s + t) − fX (s)fX (t) p+1 p+1 , cp R2p ksk ktk and the distance standard deviation, V(X), is defined as the nonnegative square root of V2(X). (We note that this terminology differs from that of Sz´ekely, et al. [41, 17], who refer to V(X) as the distance variance; we will refer to V(X) instead as the distance standard deviation, which is justified by the fact that V(X) satisfies an equivariance property that is given below in (1.4).) Also, the distance correlation coefficient is defined as V(X,Y ) (1.3) R(X,Y ) = pV(X)V(Y ) as long as V(X), V(Y ) 6= 0, and zero otherwise. We remark that the weighted L2-norm in (1.1) was studied in the univariate setting by Feuerverger [14]. The distance correlation coefficient, unlike the Pearson correlation coeffi- cient, characterizes independence: R(X,Y ) = 0 if and only if X and Y are mutually independent. Moreover, 0 ≤ R(X,Y ) ≤ 1 and, for one-dimensional random variables X,Y ∈ R, we have R(X,Y ) = 1 if and only if Y is a linear function of X, almost surely. The empirical distance correlation possesses a remarkably simple expression [17, Theorem 1], and efficient algorithms for computing it are now available [10]. THE DISTANCE STANDARD DEVIATION 3
We note that R(X,Y ) is one of several coefficients characterizing inde- pendence that are applicable to hypothesis testing. Other concepts of de- pendence are, e.g., the Hilbert-Schmidt Independence Criterion (HSIC) [19], ball covariance [31] and mutual information [5]. Each of these concepts sat- isfy numerous desirable properties, and the comparison of their properties and finite-sample performance is an active area of research [34, 37, 38]. An interesting property of the distance covariance is that its square is translation-invariant and scale-equivariant, which implies that the distance standard deviation satisfies (1.4) V(a + bX) = |b| V(X), for all a, b ∈ R [17, Theorem 4]. Moreover, V(X) is defined for all random variables X with finite first moments, whereas the classical standard devia- tion requires the existence of finite second moments. These properties sug- gest that the distance standard deviation is a potentially interesting measure of scale for heavy-tailed distributions. As the term V(X) V(Y ) appears in the denominator of R(X,Y ), a study of properties of the distance standard deviation may lead to a better understanding of the distance correlation. In this paper, we study the distance standard deviation V(X) and pro- vide applications to hypothesis testing and multivariate statistical quality control. We apply the distance standard deviation to a data set, originating from materials science, on a physical model for describing a crystal undergo- ing a structural phase transition when subjected to several cooling-heating cycles. In a different direction, we further show how V(X) can be applied in the statistical quality control of multivariate production processes. We will also compare V(X) to other measures of spread. Indeed, suppose 2 0 00 that E(kXk ) < ∞, and let X, X , and X be independent and identically distributed (i.i.d.); then, by Sz´ekely, et al. [17, Remark 3], 2 0 2 0 2 0 00 (1.5) V (X) = E(kX − X k ) + (EkX − X k) − 2E(kX − X k · kX − X k). The second term on the right-hand side of (1.5) is reminiscent of Gini’s mean difference [4, 22], which is defined for real-valued random variables Y as 0 (1.6) ∆(Y ) = E|Y − Y |, 0 where Y and Y are i.i.d. Furthermore, if X ∈ R then one-half the first summand in (1.5) equals σ2(X) = E(X2) − E(X)2, the variance of X. We provide a detailed comparison of V(X), ∆(X), and σ(X). We demon- strate that when the distributions of interest are heavy-tailed, V(X) pro- vides estimators of scale that are asymptotically more efficient than estima- tors based on ∆(X) or σ(X). Moreover, several inequalities between V(X), ∆(X), and σ(X) are derived. 4 D. EDELMANN, D. RICHARDS, AND D. VOGEL
We further show that the distance standard deviation is an axiomatic measure of spread in the sense of Bickel and Lehmann [7]. According to [7], a measure of spread is a functional τ(X) satisfying the axioms: (C1) τ(X) ≥ 0, (C2) τ(a + bX) = |b| τ(X) for all a, b ∈ R, and (C3) τ(X) ≤ τ(Y ) if for all 0 < α ≤ β < 1, F −1(β) − F −1(α) ≤ G−1(β) − G−1(α), where F and G are the cumulative distribution functions of X and Y , respectively, and F −1 and G−1 are the corresponding right-continuous inverses. The distance covariance obviously satisfies (C1) and (C2). We will show that V(X) also satisfies (C3), hence proving that V(X) is a measure of spread in the above sense. However, we will also establish some clear differences between V(X), on the one hand, and ∆(X) and σ(X), on the other hand. The paper is organized as follows. In Section2, the asymptotic distribu- tion of the empirical distance standard deviation under the existence of the second moment of X is derived. The asymptotic relative efficiency (ARE) of the empirical distance standard deviation with respect to competing estima- tors of spread is evaluated for various distributions. In Section3, we apply the empirical distance standard deviation to perform two-sample hypothesis testing for a data set from materials science and we also show the applica- bility of the empirical distance standard deviation in multivariate statistical quality control. Further, we demonstrate the superior performance of tests based on the distance standard deviation when the underlying distributions are heavy-tailed. In Section4, we derive inequalities between the summands in the distance variance representation (1.5). We will prove in the case of scalar random variables that V(X) is bounded above by ∆(X) and by σ(X). In Section5, we show that the representation (1.5) can be simplified further, revealing relationships between V(X) and the moments of spacings of order statistics. Using novel representations, we show that V(X) is a measure of spread in the sense of [7]; moreover, we identify crucial differences between V(X), ∆(X) and σ(X). We conclude the paper in Section6 with a discussion of the given results. All proofs are provided in the supplementary material.
2. The empirical distance standard deviation. In order to develop an empirical version of V2(X), Sz´ekely, et al. [17, 41] derived an alternative 2 p representation of V (X); they showed that if the random vector X ∈ R 2 0 00 satisfies EkXk < ∞ and if X, X , and X are i.i.d. then 2 (2.1) V (X) = T1(X) + T2(X) − 2 T3(X), THE DISTANCE STANDARD DEVIATION 5 where 0 2 T1(X) = E(kX − X k ), (2.2) 0 2 T2(X) = (EkX − X k) , and 0 00 (2.3) T3(X) = E kX − X k · kX − X k .
For an i.i.d. sample X = (X1,...,Xn) drawn from X, the empirical version of V2(X) was given in [17] as
2 (2.4) Vn(X) = T1,n(X) + T2,n(X) − 2 T3,n(X), where n n 1 X X T (X) = kX − X k2, 1,n n2 i j i=1 j=1 (2.5) n n 1 X X 2 T (X) = kX − X k , 2,n n2 i j i=1 j=1 and n n n 1 X X X (2.6) T (X) = kX − X k · kX − X k. 3,n n3 i j i k i=1 j=1 k=1 The version (2.4) is not unbiased; an unbiased estimator for V2(X) was derived in [10], viz.,
3 2 n n (2.7) Vb (X) = T1,n(X) + T2,n(X) n n − 3 (n − 1)(n − 2)(n − 3) 2n2 − T (X). (n − 2)(n − 3) 3,n
2 By [10], Vbn(X) is a U-statistic of order four with kernel function
4 4 !2 1 X 1 X X h(X ,X ,X ,X ) = kX − X k2 − kX − X k 1 2 3 4 4 i j 4 i j 1≤i,j≤4 i=1 j=1 i6=j j6=i !2 1 X (2.8) + kX − X k . 24 i j 1≤i,j≤4 i6=j 6 D. EDELMANN, D. RICHARDS, AND D. VOGEL
2 2 In the sequel, we derive the asymptotic distribution of Vbn(X) and Vn(X); further, we do so under conditions weaker than known previously. Hitherto, 2 the asymptotic normality of Vbn(X) was proved only under the assumption that the fourth moment of X is finite; see [9, Lemma 4.8 and Theorem 4.11]. Here, we derive the asymptotic normality under the broader assumption that the second moment of X is finite. The following lemma provides an alternative representation for the kernel function h(·) in (2.8).
Lemma 2.1. The kernel function h in (2.8) can be written as
4 1 X h(X ,X ,X ,X ) = kX − X k2 1 2 3 4 12 i j 1≤i,j≤4 i6=j 1 X − kX − X k kX − X k 12 i j i k 1≤i,j,k≤4 i,j,k all different 1 X + kX − X k kX − X k. 24 i j k l 1≤i,j,k,l≤4 i,j,k,l all different
2 In applying Lemma 2.1 to establish the asymptotic normality of Vn(X) under the assumption of finite second moments of X, let
2 h1(x) = E[h(x, X2,X3,X4)] − V (X) be the linear part in the Hoeffding decomposition [19, Section 11.4] of the kernel h and let
2 (2.9) γ = 16 E[h1(X)].
We remark that an expansion of h1(X) in our setting is given in [9, Eq. (B.6)]. Denote by X−k the sample X with the k-th observation deleted. Assum- 2 2 ing that E[h (X1,X2,X3,X4)] < ∞ (which is a consequence of E(kXk ) < ∞; cf. the proof of Theorem 2.2), it follows from Arvesen [3, Theorem 9] that the jackknife estimator
n n 2 X 2 1 X 2 (2.10) γ(X) = (n − 1) Vb (X−i) − Vb (X−j) b n−1 n n−1 i=1 j=1 is a weakly consistent estimator of γ. THE DISTANCE STANDARD DEVIATION 7
2 Theorem 2.2. Suppose that E(kXk ) < ∞. As n → ∞,
√ 2 2 d n Vbn(X) − V (X) −→ N(0, γ)(2.11) and
√ 2 2 n Vbn(X) − V (X) d (2.12) p −→ N(0, 1), γb(X) 2 and the same result holds for Vn(X).
The asymptotic distribution of Vn(X), the empirical distance standard deviation, now follows from Theorem 2.2 by the delta method. A weakly con- √ sistent estimator for the variance of the asymptotic distribution of n(Vbn(X)− V(X)) is obtained analogously from Arvesen [3, Theorem 9] and is given by
n n X 1 X 2 (2.13) ξb(X) = (n − 1) Vbn−1(X−i) − Vbn−1(X−j) . n i=1 j=1
2 Corollary 2.3. Suppose that E(kXk ) < ∞ and V(X) > 0. Then,
√ d 2 n(Vbn(X) − V(X)) −→ N 0, γ/4V (X) and √ n(Vbn(X) − V(X)) d q −→ N 0, 1 , ξb(X) and the same result holds for Vn(X).
We now consider the problem of estimating scale in a location-scale family d 2 d of the form X = µ+λZ, with µ ∈ R and E|Z| < ∞, where = denotes equal- ity in distribution. In this location-scale setting, Corollary 2.3 enables the comparison of the efficiency of the distance standard deviation to other esti- √ mators of spread. For any n-consistent and asymptotically normal estima- tor s (X), we define the asymptotic variance ASV(s (X); F ) at the distri- n n √ bution F to be the variance of the limiting distribution of n(sn(X)−s(X)) as n → ∞, where sn(X) is evaluated at an i.i.d. sequence drawn from X ∼ F and s(X) denotes the corresponding population value of sn(X). While two scale estimators (i.e., estimators satisfying property (C2) in Sec- (1) (2) tion 1) sn (X) and sn (X) may converge to different population values 8 D. EDELMANN, D. RICHARDS, AND D. VOGEL
(2) s1(X) and s2(X), respectively, sn (X) can be made consistent for s1(X) within the considered location-scale family by multiplying it with the factor s1(Z)/s2(Z) = s1(X)/s2(X). Thus we define the asymptotic relative effi- (1) (2) ciency for scale estimators [6, Eq. (2.1)] sn (X) with respect to sn (X) at the population distribution F as
(2) 2 (1) (2) ASV(sn (X); F )/(s2(X)) (2.14) ARE sn (X), sn (X); F = . (1) 2 ASV(sn (X); F )/(s1(X))
We consider as alternatives to Vn(X) the empirical standard deviation,
n 1/2 1 X (2.15) σ (X) = (X − X )2 , bn n − 1 i n i=1 where Xn denotes the sample mean of X, the empirical mean deviation n 1 X (2.16) dbn(X) = |Xi − mn(X)|, n i=1 where mn(X) denotes the sample median of X, and Gini’s mean difference,
2 X (2.17) ∆b n(X) = |Xi − Xj|. n (n − 1) 1≤i