Distance Correlation: a New Tool for Detecting Association and Measuring Correlation Between Data Sets

Distance Correlation: A New Tool for Detecting Association and Measuring Correlation Between Data Sets Donald Richards Penn State University – p. 1/44 This talk is about “cause” and “effect” Does “smoking” cause “lung cancer”? Do “strong gun laws” lower “homicide rates”? Does “aging” raise “male unemployment rates”? The difficulty of establishing causation has fascinated mankind since time immemorial. Democritus: “I would rather discover one cause than gain the kingdom of Persia.” – p. 2/44 A simpler problem: Establishing association. The variables X and Y are associated if changes in one of them tend to be accompanied by changes in the other. Positive association: Higher values of X tend to be accompanied by higher values of Y . Negative association: Higher values of X tend to be accompanied by lower values of Y . – p. 3/44 Causation implies association. But association does not imply causation. “Foot size” and “Reading ability” are positively associated, but big feet do not cause higher reading ability. Nor does better reading ability cause feet to grow larger. – p. 4/44 There are numerous ways to detect whether two variables are associated. Scatterplots Graphs Correlations coefficients Hypothesis testing An example of hypothesis testing: Is this deck of cards “fair”? – p. 5/44 Are “Smoking rate” and “lung cancer rate” associated? Here is a scary graph: – p. 6/44 The National Cancer Institute: Understanding Cancer, 2003. A 20-year time-lag between smoking rates and incidence of lung cancer; we infer a positive association between smoking and lung cancer rates in U.S. males. – p. 7/44 How about “Strength of gun laws” and “Homicide rate”? Here is another thought-provoking graph: – p. 8/44 Eugene Volokh, “Zero correlation between state homicide rate and state gun laws,” The Washington Post, Oct. 6, 2015. – p. 9/44 Lawrence H. Summers, “A disaster is looming for American men,” The Washington Post, Sept. 26, 2016. – p. 10/44 These two examples are invalid applications of linear regression. They violate the mathematical assumptions underlying linear regression. No analysis of potential outlier points or heterogeneity of the y-values. Invalid use of the Pearson correlation coefficient. It is often unwise to apply linear regression to percentage data. – p. 11/44 Random variables: X and Y X: The average height of a randomly chosen couple Y : The height of that couple’s adult child µ1 = E(X): The population mean of X µ2 = E(Y ): The population mean of Y Var(X)= E[(X µ )2]: The population variance of X − 1 Cov(X, Y )= E[(X µ )(Y µ )]: Covariance between X and Y − 1 − 2 – p. 12/44 The Pearson correlation coefficient: Cov(X, Y ) ρ = Var(X) Var(Y ) · p p A random sample from (X, Y ): (x1, y1),..., (xn, yn) n n 1 1 Sample means: x¯ = x , y¯ = y n i n i Xi=1 Xi=1 The empirical Pearson correlation coefficient: 1 n (xi x¯)(yi y¯) r = n i=1 − − n P n 1 (x x¯)2 1 (y y¯)2 n i=1 i − · n i=1 i − q P q P – p. 13/44 The Pearson coefficient is unchanged if we shift or dilate the x- or y-values. The coefficient is very useful for graphical data analysis. These graphical methods led Galton to discover regression to the mean. However, the coefficient applies to scalar random variables, only. The coefficent often is inapplicable if (X, Y ) are non-linearly related, e.g., “College faculty salaries” and “Liquor sales.” Even if r = 0, we cannot infer independence between X and Y . – p. 14/44 We desire a new correlation coefficient such that: R is applicable to random vectors X and Y R = 0 if and only if X and Y are mutually independent. R is unchanged by affine linear transformations on X or Y . R provides better data analyses than the Pearson coefficient. R etc., etc. – p. 15/44 Distance correlation Feuerverger (1993), Székely, Rizzo, and Bakirov (2007, 2009) p and q: positive integers Column vectors: s = (s ,...,s )′ Rp, t = (t ,...,t )′ Rq 1 p ∈ 1 q ∈ Euclidean norms: s = (s2 + + s2)1/2 and t = (t2 + + t2)1/2 k k 1 · · · p k k 1 · · · q – p. 16/44 Jointly distributed random vectors: X Rp and Y Rq ∈ ∈ The joint characteristic function (Fourier transform): ψ (s, t)= E exp √ 1 (s′X + t′Y ) , X,Y − The marginal characteristic functions: ψ (s)= E exp √ 1 s′X X − and ψ (t)= E exp √ 1 t′Y Y − – p. 17/44 A well-known result: X and Y are mutually independent iff ψX,Y (s, t)= ψX (s)ψY (t) for all s and t. We will measure (X, Y ) dependence by a weighted L2-distance: ψ (s, t) ψ (s)ψ (t) X,Y − X Y L2(Rp×Rq) – p. 18/44 The distance covariance between X and Y : 1/2 1 2 ds dt (X, Y )= ψX,Y (s, t) ψX (s)ψY (t) p+1 q+1 V γ γ ZRp+q − s t p q k k k k where γp is a suitable constant. This definition makes sense even if X = Y . The distance variance of X: 1/2 1 2 ds du (X, X)= 2 ψX (s + u) ψX (s)ψX (u) p+1 p+1 V γ ZR2p − s u p k k k k – p. 19/44 The distance correlation between X and Y : (X, Y ) (X, Y )= V R (X, X) (Y, Y ) V V p A characterization of independence: If (X, Y ) have finite first moments then 0 (X, Y ) 1 ≤R ≤ and (X, Y ) = 0 if and only if X and Y are independent R – p. 20/44 An Invariance Property: (X, Y ) is invariant under the transformation R (X, Y ) (a + b C X,a + b C Y ) 7→ 1 1 1 2 2 2 p q p×p where a1 R , a2 R ; non-zero b1,b2 R; and C1 R and C ∈Rq×q are∈orthogonal matrices.∈ ∈ 2 ∈ Note: (X, Y ) is not invariant under all affine transformations. R – p. 21/44 Calculating Pearson’s correlation coefficient often is easy. Calculating the distance correlation coefficient often is difficult. The singular integral 2 ds dt ψX,Y (s, t) ψX (s)ψY (t) p+1 q+1 ZRp+q − s t k k k k cannot be calculated by expanding the factor ψ (s, t) ψ (s)ψ (t) 2 X,Y − X Y and then integrating term-by-term. Dueck, Edelmann, and D.R., J. Multivariate Analysis, 2017 – p. 22/44 The empirical distance correlation Random sample: (X1, Y 1),..., (Xn, Y n) from (X, Y ) Data matrices: X = [X1,..., Xn], Y = [Y 1,..., Y n] The empirical joint characteristic function: n 1 ′ ′ ψX Y (s, t)= exp √ 1 (s X + t Y ) , n − j j Xj=1 The empirical marginal characteristic functions: n n 1 ′ 1 ′ ψX (s)= exp √ 1 s X , ψY (t)= exp √ 1 t Y n − j n − j Xj=1 Xj=1 – p. 23/44 The empirical distance covariance: 1/2 1 2 ds dt X Y X Y X Y n( , )= ψ , (s, t) ψ (s)ψ (t) p+1 q+1 V γ γ ZRp+q | − | s t p q k k k k ψX,Y , ψX , and ψY are sums of exponential terms. (X, Y ) surely has to be a complicated function of the data. Vn And yet, . Feuerverger (1993), Székely, et al. (2007) derived an explicit formula for (X, Y ): Vn – p. 24/44 Define n n 1 1 a = X X , a¯ · = a , a¯· = a kl k k − lk k n kl l n kl Xl=1 Xk=1 n 1 a¯·· = a , A = a a¯ · a¯· +a ¯·· n2 kl kl kl − k − l k,lX=1 Define b = Y Y , ¯b ·, ¯b· , ¯b··, and B similarly. Then, kl k k − lk k l kl n 1 1/2 (X, Y )= A B Vn n kl kl k,lX=1 (X, Y )]2 is the average entry in the Schur product of the Vn centered distance matrices for X and Y . – p. 25/44 A crucial singular integral: For x Rp and Re(α) (0, 2), ∈ ∈ s −(p+α) 1 exp(√ 1s′x) ds = const. x α, ZRp k k − − k k with absolute convergence for all x. Ongoing work with Dueck, Edelmann, Sahi: Generalizations to matrix spaces and symmetric cones. – p. 26/44 The empirical distance correlation: n(X, Y ) n(X, Y )= V R (X, X) (Y , Y ) Vn Vn p Some properties: 0 (X, Y ) 1 • ≤Rn ≤ (X, Y ) = 1 implies that: • Rn (a) p = q, (b) The linear spaces spanned by X and Y have full rank, and (c) There exists an affine relationship between X and Y . – p. 27/44 The distance correlation coefficient has been found to: Exhibit higher statistical power (i.e., fewer false positives) than the Pearson coefficient Detect nonlinear associations that were not found by the Pearson coefficient, and Locate smaller sets of variables that provide equivalent statistical information. – p. 28/44 Astrophysical data Mercedes Richards, Elizabeth Martínez-Gómez, and D.R., Astrophysical Journal, 2014 The COMBO-17 database “Classifying Objects by Medium-Band Observations in 17 filters” High-dimensional data on many astrophysical objects in the Chandra Deep Field South. – p. 29/44 “A high-resolution processed image of the Chandra Deep Field South, taken from the European Southern Observatory’s catalog of images of the region. The image shows tens of thousands of galaxies, thousands of stars, and hundreds of quasars.” – p.

Load more