Distance Correlation for Vectors: a SAS Macro Thomas E

® Distance Correlation for Vectors: A SAS Macro Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California This work by Thomas E. Billings is licensed (2016) under a Creative Commons Attribution 4.0 International License. ABSTRACT The Pearson correlation coefficient is well-known and widely-used. However, it suffers from certain constraints: it is a measure of linear dependence (only) and does not provide a test of statistical independence, and it is restricted to univariate random variables. Since its inception, related and alternative measures have been proposed to mitigate these constraints. Several new measures to replace or supplement Pearson correlation have been proposed in the statistical literature in recent years. Székeley et al. (2007) describes a new measure - distance correlation - that overcomes the shortcomings of Pearson correlation. Distance correlation is defined for 2 random variables X, Y (which can be vectors) as a weight or distance function applied to the difference between the joint characteristic function for (X,Y) and the product of the individual characteristic functions for X, Y. In practice it is estimated by computing the individual distance matrices for X, Y, and distance correlation is a similarity measure for the 2 matrices. For the bivariate normal case, distance correlation is a function of Pearson correlation. Distance correlation also supports a related test of statistical independence, and has performed well in simulation studies comparing it with other alternatives to Pearson correlation. Here we present a Base SAS® macro to compute distance correlation for arbitrary real vectors. BACKGROUND: PEARSON CORRELATION COEFFICIENT AND ALTERNATIVES The Pearson correlation coefficient is the most common and widely used correlation measure, supported in virtually all major statistical software systems. It has a (very) close association with the method of linear regression. The coefficient was derived by Karl Pearson, based on work done by Francis Galton in the 1880’s. The coefficient is defined as: ρ = CORR(X,Y) = Covariance(X,Y) / (SD(X)*SD(Y)) where: SD(*) = standard deviation(*) = sqrt( variance(*) ). The R2 statistic commonly computed in linear regression is the square of the Pearson correlation between the dependent variable (Y) and the fitted regression estimates. The Pearson correlation coefficient has a number of major constraints: • it is defined for univariate random variables only, and • it does not serve as a measure of general statistical dependence or independence; instead it is primarily a measure of linear dependence. Regarding the latter point, the standard example found in some statistical texts is: let Y = X2; then cov(X,Y)=0 even though Y is a function of X; e.g., see DeGroot (1975). 1 Numerous measures have been proposed as alternatives to overcome some of the limitations of Pearson correlation. Alternatives date back many years and are not limited to recent efforts. Some examples that are supported in SAS® PROC CORR include the following: • Spearman rank, • Kendall’s tau-b, • Hoeffding’s D statistic. PROC CORR also supports partial and polychoric correlation measures. Some interesting alternative statistics proposed in recent years include: • Maximal information coefficient (Reshef et al. 2011) - prematurely and incorrectly promoted as “a correlation for the 21st century” by Terry Speed (2011); however subsequent research reveals it is badly flawed, to say the least – see Kinney & Atwal (2014), Simon & Tibshirani (2014); • Mutual information – Kullback-Leibler divergence between the joint probability distribution for (X,Y) and the product of the individual probability distributions for X, Y; • Copula correlation (Ding & Li 2013); • Distance correlation (Székeley et al. 2007) and variations of the above statistics. There is a rich statistical literature on correlation and related association measures; the above is just a sample – additional measures have been proposed in the literature. Simulation studies comparing alternative measures show mixed results based on differences in methodology and sample sizes. However a number of studies that use data with noise suggest that distance correlation may be the best of the recent alternatives; Simon & Tibshirani (2014), also see Clark (2013). DISTANCE CORRELATION Distance correlation is a new measure (first described in 2007 and expanded in 2009) that overcomes the major limitations of Pearson correlation, i.e.: • it is defined for random variables of arbitrary dimension – in fact X & Y can have different dimensions so long as you use a conformable weight or distance function, and • it provides/supports a test of general statistical independence The basic properties of the measure are: • It is defined in the closed interval [0,1]; in contrast, Pearson correlation is defined in the closed interval [-1,1] and • for the bivariate normal case, distance correlation is a function of the Pearson correlation ρ Formal definition. Consider 2 sequences of real-valued vectors Xi, Yi, i=1,…,n; X has dimension p, Y has dimension q, where p, q are positive integers. It is possible that p=q but this is not required. At this stage we digress to remind readers of some basics: the characteristic function is defined as f(t) = E(eitX) aka the Fourier-Stieltjes transform of the CDF (cumulative distribution function) F. For additional background on characteristic functions, see Chung (1974). 2 Now let fx, fy denote the characteristic functions for random variables X,Y with only 1 limitation: X and Y both have finite first moments (euclidean norm). Distance covariance is defined as 2 dcov(X,Y) = || fx,y (t,s) – fx(t)fy(s)|| w where ||*|| denotes an arbitrary complex positive weight (distance) function defined in Rp+q that is integrable in L2. Distance variance – denoted as dvar(*) has the similar obvious definition, substituting X for Y in the above. Similar to Pearson correlation we then define the distance correlation between X and Y as: dcor(X,Y) = dcov(X,Y) / (sqrt( dvar(X)*dvar(Y) )) (1) Note that distance covariance and correlation can be defined without use of characteristic functions. Lyons (2013) characterizes the metric spaces where distance correlation supports a test of independence as being of “strong negative type” where “negative type is equivalent to a certain property of embeddability into Hilbert space”. Interested readers are encouraged to review Lyons (2013) for details. Estimators. Simplifying the notation somewhat, let X,Y be real-valued vector sequences with n rows that have no missing values. X & Y can be the same or different dimensions, and rows with missing values can be filtered out or filled with proxy values. Let: akl = |Xk – Xl|p = distance between rows k & l in the X variable vector sequence. Construct an n X n distance matrix with all values of akl, This will be a symmetric matrix with a zero diagonal. Define marginal and grand means for the distance matrix in the usual way – i.e., ak. a.l a..; then define the adjusted distance matrix for X: Akl = akl – ak. – a.l + a.. Define a similar distance matrix for the Y variable vector sequence, denoted as bkl and Bkl The estimator for distance covariance is then: -2 dcov(X,Y) = n * ∑(Akl * Bkl) and dvar(z) has the similar definition setting X=Y=z. Distance correlation is then estimated using the estimators for dcov and dvar in equation (1) above. A statistic that provides a test of independence based on distance correlation is also available and described in theorem 6 of Székeley et al. (2007). A less technical way to describe distance correlation is as follows: • Given 2 random variables X, Y which can be vectors or other non-univariate variables, • construct the square distance matrix for each variable; • distance correlation is then a measure of the similarity between the 2 distance matrices, i.e., • are the se of changes in distance between say rows k & l, similar across the 2 distance matrices? IMPLEMENTATION IN SAS The estimators for distance correlation require the computation of multiple matrices and marginal vectors, and could easily be done in SAS/IML. However: • SAS/IML is not widely used or available (most commercial sites do not have the product licensed), and • SAS/IML is an old and dated product compared to open-source alternatives. 3 Instead, the Base SAS product is used here as everyone has it. The general processing outline is: 1. compute distance matrices using PROC DISTANCE; 2. marginals, sums, grand means are computed using PROC SUMMARY; 3. DATA steps are used for other calculations, and 4. encapsulate the code for parts 1-3 above in macros to support custom applications. A set of SAS macros was developed for this project: • Given a data set, compute square distance matrix for selected variables (Euclidean distance); macro %raw_dist_mtx • Given a data set, compute marginal means, sums, and also grand means, sums; macro %mgn_plus, • Given distance matrices and marginals, computed adjusted distance matrices per algorithm; macro %int_adj_matrices, • Given distance matrices in data sets, compute the distance correlation and associated test statistic for a user-specified critical value; macro %dist_corr. Code for the macros above can be found in the Appendix. The code is released under the BSD 2-clause open-source license which allows reuse for both commercial and non-commercial applications. VALIDATION – COMPARE TO R DCOR FUNCTION While the computations required for the estimator are straightforward, savvy readers will understand that errors are (always) possible in implementation, so can we validate the SAS macros against another calculation of distance correlation? The answer is yes: distance correlation is available in R package energy (Rizzo and Székeley, 2014), with R functions developed and checked by the same team that wrote the 2007 paper. Next, we need a data set to compute distance correlation with, using both SAS and R. The well-known “iris” data set is available in R and also in SAS help file: sashelp.iris. We choose here to use the iris data from R rather than from SAS, for copyright reasons.

Distance Correlation for Vectors: a SAS Macro Thomas E

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support