LOCAL DISTANCE CORRELATION: AN EXTENSION OF LOCAL GAUSSIAN CORRELATION
Walaa Hamdi
A Dissertation
Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
August 2020
Committee:
Maria Rizzo, Advisor
Jari Willing, Graduate Faculty Representative
Wei Ning
Junfeng Shang Copyright c August 2020 Walaa Hamdi All rights reserved iii ABSTRACT
Maria Rizzo, Advisor
Distance correlation is a measure of the relationship between random vectors in arbitrary di- mension. A sample distance covariance can be formulated in both an unbiased estimator and a biased estimator of distance covariance, where distance correlation is defined as the normalized coefficient of distance covariance. The jackknife empirical likelihood for a U-statistic by Jing, Yuan, and Zhou (2009) can be applied to a distance correlation since the empirical likelihood method fails in nonlinear statistics. A Wilks’ theorem for jackknife empirical likelihood is shown to hold for distance correlation. This research shows how to construct a confidence interval for distance correlation based on jackknife empirical likelihood for a U-statistic, where the sample distance covariance can be represented as a U-statistic. In comparing coverage probabilities of confidence intervals for distance correla- tion based on jackknife empirical likelihood and bootstrap method, coverage probabilities for the jackknife empirical likelihood show more accuracy. We propose the estimation and the visualization of local distance correlation by using a local version of the jackknife empirical likelihood. The kernel density functional estimation is used to construct the jackknife empirical likelihood locally. The bandwidth selection for kernel function should minimize the distance between the true density and estimated density. Local distance correlation has the property that it equals zero in the neighborhood of each point if and only if the two variables are independent in that neighborhood. The estimation and visualization of local distance correlation are shown as accurate to capture the local dependence when compared with the local Gaussian correlation in simulation studies and real examples. iv
My thanks to My Mom Nawal and Dad Ahmed; My husband Motaz, for his support; My daughters Rafif, Joanna, and Taleen v ACKNOWLEDGMENTS
I would like to thank my advisor Dr. Rizzo for her precious advice and guidance to complete this dissertation. I appreciate her way of giving me different viewpoints and suggestions. I also appreciate her time to meet with me to discuss the results in more detail. I am honored that Dr. Rizzo is my dissertation advisor. I express my sincere thanks to committee members Dr. Junfeng Shang, Dr. Wei Ning, and Dr. Jari Willing, who supported me until I completed my degree. I also express my appreciation to all my professors in the Department of Mathematics and Statistics for their help and guidance. I am especially thankful to the Graduate Coordinator Dr. Craig Zirbel for his support of graduate students. I cannot express enough thanks to my husband Motaz and my parents for encouraging me to complete my degree with their best wishes. I will always remember my husband Motaz supporting me and always being by my side. I love my daughters and they always make me happy. I extend my gratitude to all of my family and my husband’s family who are directly or indirectly supporting me to complete my degree. Finally, I am glad to be a graduate student at Bowling Green State University. vi TABLE OF CONTENTS Page
CHAPTER 1 INTRODUCTION ...... 1
CHAPTER 2 LITERATURE REVIEW ...... 5 2.1 Background on Dependence Coefficients ...... 5 2.2 Bivariate Dependence Measure ...... 5 2.3 Multivariate Dependence Measure ...... 8 2.4 Properties of Dependence Measure ...... 9 2.5 Distance Correlation ...... 12 2.6 Local Correlation ...... 13 2.7 Multiscale Graph Correlation ...... 15
CHAPTER 3 OVERVIEW OF DISTANCE CORRELATION ...... 17 3.1 Distance Correlation ...... 17 3.2 Modified Distance Correlation ...... 22 3.3 Unbiased Distance Correlation ...... 23
CHAPTER 4 CONFIDENCE INTERVAL FOR DISTANCE CORRELATION ...... 31 4.1 Confidence Intervals for Distance Correlation ...... 32 4.1.1 U-statistic Results ...... 32 4.1.2 Jackknife Empirical Likelihood for Distance Correlation ...... 39 4.2 Simulation Study ...... 47 4.3 Real Examples ...... 51
CHAPTER 5 LOCAL DISTANCE CORRELATION ...... 55 5.1 Local Gaussian Correlation ...... 55 5.1.1 Estimation of Local Gaussian Correlation ...... 56 5.1.2 Choice of Bandwidth for Kernel Function ...... 57 vii 5.1.3 Properties of Local Gaussian Correlation ...... 58 5.1.4 Global Gaussian Correlation ...... 59 5.2 Local Distance Correlation ...... 61 5.2.1 Estimation of Local Distance Correlation ...... 61 5.2.2 Choice of Bandwidth for Kernel Function ...... 67 5.2.3 Properties of Local Distance Correlation ...... 76 5.3 Simulation Study ...... 82 5.4 Real Examples ...... 91 5.4.1 Example 1: Aircraft ...... 91 5.4.2 Example 2: Wage ...... 92 5.4.3 Example 3: PRIM7 ...... 95 5.4.4 Example 4: Olive Oils ...... 97
CHAPTER 6 SUMMARY AND FUTURE WORK ...... 104
BIBLIOGRAPHY ...... 107
APPENDIX A SELECTED R PROGRAMS ...... 113 viii LIST OF FIGURES Figure Page
4.1 Scatterplot matrix of pairwise association of six fatty acids ...... 53
5.1 Contour plots of the true density and kernel estimate functions ...... 75 5.2 Scatter plot of X and Y ...... 77 5.3 Illustration of exchange symmetry ...... 78 5.4 Illustration of reflection symmetry ...... 79 5.5 Illustration of radial symmetry ...... 80 5.6 Scatter plots when rotated for 90o and 180o ...... 81 5.7 Illustration of rotation symmetry ...... 82 5.8 Scatter plots of different bivariate dependence structures ...... 83 5.9 Contour plots of different bivariate dependence structures ...... 84 5.10 The visualization of local Gaussian correlation and local distance correlation . . . 85 5.11 The visualization of local Gaussian correlation and local distance correlation . . . 86 5.12 The visualization of local Gaussian correlation and local distance correlation . . . 87 5.13 The visualization of local Gaussian correlation and local distance correlation . . . 88 5.14 The visualization of local Gaussian correlation and local distance correlation . . . 89 5.15 The visualization of local Gaussian correlation and local distance correlation . . . 90 5.16 Scatter and contour plots for aircraft dataset ...... 92 5.17 The visualization of local Gaussian correlation and local distance correlation for aircraft dataset ...... 93 5.18 Smooth scatter plot for Wage dataset ...... 94 5.19 The visualization of local Gaussian correlation and local distance correlation for Wage dataset ...... 95 5.20 Scatter and smooth scatter plots for PRIM7 dataset ...... 96 5.21 The visualization of local Gaussian correlation and local distance correlation for PRIM7 dataset ...... 97 ix 5.22 Smooth scatter plot for oleic and palmitoleic fatty acids ...... 98 5.23 The visualization of local Gaussian correlation and local distance correlation for oleic and palmitoleic fatty acids ...... 99 5.24 Smooth scatter plot for palmitic and steartic fatty acids ...... 100 5.25 The visualization of local Gaussian correlation and local distance correlation for palmitic and steartic fatty acids ...... 101 5.26 Smooth scatter plot for linoleic and linolenic fatty acids ...... 102 5.27 The visualization of local Gaussian correlation and local distance correlation for linoleic and linolenic fatty acids ...... 103 x LIST OF TABLES Table Page
4.1 Coverage probabilities and average interval lengths of 90% confidence interval for R2 ...... 50 4.2 Coverage probabilities and average interval lengths of 95% confidence interval for R2 ...... 50 4.3 Coverage probabilities and average interval lengths of 99% confidence interval for R2 ...... 51 4.4 Summary statistics of fatty acids ...... 52 4.5 The confidence intervals for bias-corrected distance correlation of bivariate vari- ables of monounsaturated fats, saturated fats, and polyunsaturated fats ...... 54 1
CHAPTER 1 INTRODUCTION
Correlation is a bivariate coefficient which measures the association or relationship between two random variables. The correlation coefficient is one of the interesting topics in statistics, be- cause statisticians have been developing different ways to quantify the relationship between vari- ables and properties of dependence measure. We can find a point estimate and calculate confidence intervals to estimate the population correlation. Point estimation is used to calculate a single value for estimating the population correlation coefficient and the confidence intervals are defined as a range of values that contains the population correlation coefficient. Moreover, a hypothesis test for the population correlation coefficient is used to evaluate two mutually exclusive statements about a population from sample data. Pearson correlation is the most commonly used method to study the relationship between two random variables, but it fails to capture nonlinear dependence. For non-Gaussian random variables, the correlation coefficient value can be close to zero, even if the variables are dependent. Szekely,´ Rizzo, and Bakirov (2007) introduced distance correlation, a nonparametric approach, which is a new measure of testing multivariate dependence between random vectors. Distance correlation is analogous to the product-moment Pearson correlation coefficient, but distance correlation is able to detect linear and nonlinear dependence structure. The distance correlation is defined by normalized coefficient of distance covariance, where the sample distance covariance has both a U-statistic and V -statistic representation. One goal of this research is to construct the confidence interval for distance correlation of mul- tivariate dependence between random vectors. The confidence intervals carry more information about the population correlation than the results of hypothesis test, since the confidence interval provides a range of likely values of the population distance correlation coefficient. The bootstrap method for the construction of confidence interval is the most widely used method for a nonpara- metric approach, but the bootstrap may fail to give information about the population. The empirical likelihood method is also used to construct the confidence interval, but it fails to obtain a chi-square 2 limit for nonlinear functions. Jing, Yuan, and Zhou (2009) proposed jackknife empirical likelihood method for a U-statistic that benefits from simple optimization utilizing jackknife pseudo-samples. The jackknife esti- mator for the parameter of interest becomes the sample mean of jackknife pseudo-samples, and the empirical likelihood method can be applied for jackknife pseudo-sample mean. Szekely´ and Rizzo (2014) considered the unbiased estimator of squared distance covariance, which is based on U-centering. A bias-corrected distance correlation is defined by normalizing the inner product statistic with the bias-corrected distance variance statistics. Since an unbiased distance covaiance is a U-statistic, in this research we employ the jackknife empirical likelihood method to construct confidence intervals for the distance correlation. Moreover, we show that a Wilks’ theorem holds for jackknife empirical likelihood for distance correlation. The coverage probability and interval length are associated with confidence interval estimation. We construct the jackknife empirical likelihood confidence intervals for the distance correlation with Monte Carlo simulation that provides information about the accuracy of coverage probability and the average of interval length. Criteria for a good confidence interval estimator is to have a coverage probability close to the nominal level and short interval length. In this paper, we compare the jackknife empirical likelihood confidence intervals and standard normal bootstrap confidence intervals for the distance correlation by computing coverage probabilities and interval lengths. Sometimes the global measures of dependence cannot give enough information about the asso- ciation between random variables, because the global measure gives information about the relation between variables in the whole study area. A method of local dependence measure called local Gaussian correlation was considered by Tjøstheim and Hufthammer (2013). It was derived from a local correlation function using local likelihood based on approximating a bivariate density lo- cally from a family of bivariate Gaussian densities. At each point, the correlation coefficient of the approximation of Gaussian distribution is taken as the local correlation. The estimation of local likelihood is based on measuring a density function by a known parametric family. The band- width algorithm of Tjstheim and Hufthammer is not really satisfactory in a general situation but 3 Berentsen and Tjøstheim (2014) introduced another method to choose the bandwidth based on the principle of likelihood cross-validation. An important application of local Gaussian correlation is the visualization of local dependence structures. Berentsen and Tjøstheim (2014) considered the global Gaussian correlation by aggregating local Gaussian correlation on subsets of R2 to get a global measure of dependence. The focus of this research is to introduce a new method of estimation and visualization of local distance correlation measure between two univariate random variables. A local distance correlation is able to capture the local dependence structure in a small region which better describes the dependence structures. The approach to construct local distance correlation by a local version of the jackknife empirical likelihood was extended from local Gaussian correlation. To estimate local distance correlation, we use the kernel density functional estimation to construct jackknife empirical likelihood locally. It is important to choose the appropriate bandwidths for the kernel function where the bandwidth is a window taken in order to determine how much of the data within this window are used to estimate each local estimator of distance correlation. We consider three common bandwidth selections for kernel function to determine the appropriate one for the data. The properties of local distance correlation should remain the same as the properties of distance correlation. We have implemented a fast O(n log n) algorithm for the computation of bivariate distance correlation to build a fast visualization tool with local distance correlation that can be applied to very large data sets. We compare the local distance correlation with local Gaussian correlation in order to determine the performance of both methods in nonlinear data sets. This dissertation is organized as follows: In Chapter 2, we discuss the most important develop- ments in the history of correlation coefficient for measuring bivariate and multivariate association as well as provide details about local correlation. In Chapter 3, we provide an overview of distance correlation. In Chapter 4, we construct the confidence interval for distance correlation based on the jackknife empirical likelihood for a U-statistic and show that Wilks’ theorem holds for distance correlation. In addition, we compare the performance between the confidence interval for distance 4 correlation of the jackknife empirical likelihood and the bootstrap method. In Chapter 5, we intro- duce the local distance correlation for bivariate cases and discuss the local Gaussian correlation. The properties of local distance correlation and the choice of bandwidth for the kernel function are discussed as well. We compare the local distance correlation with local Gaussian correlation by simulation studies and real life examples. Chapter 6 presents the summary and future works of this research. 5
CHAPTER 2 LITERATURE REVIEW
2.1 Background on Dependence Coefficients
Correlation is one of the interesting topics in statistics related to scientific discovery and inno- vation, used to indicate the relation between two continuous variables. Galton (1888) first defined the concept of correlation in the following way: “Two variable organs are said to be co-related when the variation of the one is accompanied on the average by more or less variation of the other, and in the same direction. It is easy to see that co-relation must be the consequence of the vari- ations of the two organs being partly due to common causes. If they were in no respect due to common causes, the co-relation would be nil.” Galton (1890) developed his ideas of correlation to better understand the relationship between variables by collecting large data sets. Pearson (1896) used the word “Coefficient of Correlation” in the paper entitled “Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and Panmixia,” where he considered Pearson’s product-moment correlation which is used to find the linear association between two variables and it can be only used on quantitative variables. Pearson’s product-moment correlation was developed after the initial mathematical formulae for correlation discovered by Bravais (1844). Spearman (1904) adapted Pearson’s idea to be applied by substituting ranks for measurements in Pearson’s product-moment formula. Spearman’s rank correlation coefficient, which is known as Spearman’s rho, is one of the oldest formulas based on ranks that measures the association between two variables. In the first part of Spearman’s paper, he determined that a good method of correlation must meet some requirements such as quantitative expression, the significance of the quantity, accuracy, and ease of application. Also, he considered the advantages and disadvantages of comparison by rank.
2.2 Bivariate Dependence Measure
Hotelling and Pabst (1936) published a paper on rank correlation which avoids the assumption of normality. They defined the most convenient formula for computing the rank correlation when 6 Pearson’s correlation is applied to ranked data. However, they concluded in their paper that the rank correlation coefficient is easier to compute for samples smaller than 40 because Pearson’s correlation works conveniently for larger sample sizes. A new measure of rank correlation which is used to measure association between two random variables, was defined by Kendall (1938). Kendall’s statistic, known as Kendall’s tau, is based on concordant and discordant pairs. It is mostly used in psychological experiments when the ranks are known. Hoeffding (1948b) introduced a test for independence between two data sets, called Hoeffding’s D measure, by measuring the distance between the product of marginal probability distributions and the joint distribution. Blomqvist (1950) considered a simpler dependence measure, based on the number of sample
data n1 that belongs to the first or third quadrant and the number n2 that belongs to the second or fourth quadrant. It is defined as
0 n − n 2n 0 q = 1 2 = 1 − 1, (−1 ≤ q ≤ 1). n1 + n2 n1 + n2
This dependence measure is often called the medial correlation coefficient. The population version of the medial correlation coefficient for a pair of continuous random variables X and Y with medians x˜ and y,˜ respectively, is given as
q = P {(X − x˜)(Y − y˜) > 0} − P {(X − x˜)(Y − y˜) < 0}.
The asymptotic efficiency of a test based on the statistic q is about 41 percent, and this test is similar to Fisher (1941) in the special case of the exact test of independence in the 2 × 2 table. Fisher’s exact test was proposed in 1941 and is used when members of two independent groups can fall into one of two mutually exclusive categories to determine whether the proportions of those falling into each category differ by group. An ordinal invariant measure of association for bivariate populations were discussed by Kruskal (1958). He focused on the probabilistic and operational interpretations of their population val- 7 ues and discussed the relationships between three measures: the quadrant measure of Blomqvist, Kendall’s tau, and Spearman’s rho. Blum, Kiefer, and Rosenblatt (1961) suggested a test of independence based on empirical dis- tribution functions for the case when dimension d ≥ 2, via appropriate Cramer-von Mises statistics, significant for large values of
Z d Y 2 Bn = (Sn(r) − Snj(rj)) dSn(r), (2.2.1) j=1
where Sn(r) is the sample distribution function of independent random d−variate vectors X1, ..., Xn with unknown distribution function F, and Snj is the marginal distribution function associated with
th the j component of the Xi. They obtained the characteristic functions of the limiting distribu- tions of a class of such test criteria and provided the table in the bivariate case, d = 2, for the corresponding distribution functions. Furthermore, the tests have asymptotic normal distributions and they are equivalent to the test proposed by Hoeffding (1948b) when d = 2. Bhuchongkul (1964) considered the class of rank tests for independence and proposed a test statistic based on the given form
N X 0 0 T = N −1 E E Z Z , N N,ri N,si N,ri N,si i=1 where
th 0 1,Xi is the ri smallest of the X s; ZN,ri = 0, otherwise
th 0 0 1,Yi is the si smallest of the Y s; Z = N,si 0, otherwise
{E } {E0 }, i = 1, ..., N and N,ri and N,si are two sets of constants satisfying certain restrictions and th th taken as the expected value of the ri and si standard normal order statistic from a sample size 8 N. He concluded that the normal scores test is shown to be the locally most powerful rank test and asymptotically as efficient as the parametric correlation coefficient test for specified alternatives E = r E0 = s , when the underlying distributions are normal. If N,ri i and N,si i then the test statistic is equivalent to the Spearman rank correlation statistic.
2.3 Multivariate Dependence Measure
Wilks (1935) introduced the classic parametric test based on
|A| W = , |A11||A22|
where A11 is the r × r sample covariance matrix of the first sample, A22 is s × s sample covariance matrix of the second sample, and A is the partitioned sample covariance matrix. The test is a
(1) likelihood ratio test, which is optimal under the multivariate normal model for testing H0 : xi and (2) (1) (2) xi are independent, where xi and xi for i = 1, ..., n are r−dimensional and s−dimensional d continuous random vectors, respectively. Under H0 with finite fourth moments, −n log W −→
2 χrs, where n is sample size. A disadvantage of the likelihood ratio test is that it is not usable if the dimension grater than the sample size or the distributional assumption does not hold. Sinha and Wieand (1977) extended Bhuchongkul’s rank statistics (1964) for testing multivari- ate independence. They have shown that the test statistic can be expressed as rank statistic which is easy to compute. The test statistic has asymptotic normal distribution and can detect mutual dependence in alternatives which are pairwise independent.
The nonparametric statistic, Qn was introduced by Geiser and Randles (1997) based on interdi- rections for testing whether two vector-valued quantities are independent. The interdirection count measures the regular distance between the two observation vectors relative to the origin and posi- tions of the other observation. This statistic has an intuitive invariance property but it is reduced to the quadrant statistic when the two quantities are univariate. When each vector is elliptically symmetric, Qn has a limiting chi-squared distribution under the null hypothesis of independence. Geiser and Randles compared their test to Wilks’ likelihood ratio criterion when the vectors have 9 heavy-tailed elliptically symmetric distributions, and they gave an example when Qn is resistant to outliers. The Qn test is better than the componentwise quadrant statistic when the vectors are spherically symmetric. In addition, they showed that Qn performs better than the other tests for heavy-tailed distributions and is competitive for distributions with moderate tail weights. The extensions of the quadrant test of Blomqvist (1950) based on spatial signs, which are easier to compute for data in common dimensions, were introduced by Taskinen, Kankainen, and Oja (2003). Their test statistic is asymptotically equivalent to the interdirection test of Gieser and Randles (1997) when the vectors are elliptically symmetric, but it is easier to compute in practice. Taskinen, Oja, and Randles (2005) provided practical, robust alternatives to normal- theory methods and discussed a sequel to the multivariate extension of the quadrant test by Gieser and Randles (1997) as well as Taskinen, Kankainen, and Oja (2003). Taskinen, Oja, and Randles presented new multivariate extensions of Kendall’s tau and Spearman’s rho statistics using the two different approaches. First, interdirection proportions are used to estimate the cosines of angles between centered observation vectors and differences of observation vectors. Second, covariances between affine-equivariant multivariate signs and ranks are used. If each vector is elliptically symmetric, then the test statistic arising from these two approaches appears to be asymptotically equivalent. Szekely,´ Rizzo, and Bakirov (2007) introduced distance correlation which is a unique nonpara- metric approach to measure of dependence between random variables, see Section 2.5 for more details.
2.4 Properties of Dependence Measure
Renyi` (1959), who introduced properties of measures of dependence, proposed seven axioms of a nonparametric measure of dependence for two random variables on a probability space.
1. δ(X,Y ) is defined for any pair of random variables X and Y , no one of them being constant with probability 1.
2. δ(X,Y ) = δ(Y,X). 10 3. 0 ≤ δ(X,Y ) ≤ 1.
4. δ(X,Y ) = 0 if and only if X and Y are independent.
5. δ(X,Y ) = 1 if there exists a strict dependence between X and Y, that is either Y = f(X) or X = g(Y ), where f(X) and g(Y ) are Borel-measurable functions.
6. If f(X) and g(Y ) are the one-to-one Borel-measurable functions on R, then
δ(f(X), g(X)) = δ(X,Y ).
7. If the joint probability distribution of X and Y is normal, then δ(X,Y ) = |ρ(X,Y )|, where ρ(X,Y ) is correlation coefficient between X and Y.
Renyi’s` fifth condition, is not a strong condition because functional relationship is sufficient but not necessary, according to Li (2015). Renyi` showed that the mean square contingency, correlation coefficient, and correlation ratios measures satisfy some of these properties, but the maximal cor- relation coefficient proposed by Gebelein (1941) satisfies all seven axioms of dependence measure. The maximal correlation is defined
0 ρ (X,Y ) = supf,gρ(f(X), g(Y )), where ρ is the Pearson product-moment correlation coefficient and f, g are Borel measurable func- tions. Another measure of dependence that satisfies all of Renyi’s` axioms is the information coef- ficient of correlation, introduced by Linfoot (1957), which is defined for two continuous random variables as r = p1 − exp(−2I(X,Y )), 11 where I(X,Y ) is mutual information for any pair of continuous random variables X and Y, which is defined by Shannon (1948) as
ZZ p(X,Y ) I(X,Y ) = p(X,Y ) log , p(X)p(Y ) where p(X,Y ) is joint bivariate normal distribution and p(X) and p(Y ) are the bivariate normal distributions of X and Y, respectively. Mori´ and Szekely´ (2019) proved that if a dependence measure is defined for bounded noncon- stant real valued random variables and is invariant with respect to one-to-one measurable transfor- mations of the real value, then the dependence measure cannot be weakly continuous; that means when the sample size increases, the empirical values of a dependence measure do not necessarily converge to the population value. They developed four axioms for dependence measures when (X,Y ) ∈ S, where S is a nonempty set of pairs of nondegenerate random variables X and Y which are taking values in Euclidean spaces or separable Hilbert spaces H, as following:
1. δ(X,Y ) = 0 if and only if X and Y are independent.
2. δ(X,Y ) is invariant with respect to all similarity transformations of H; that is,
δ(f(X), g(Y )) = δ(X,Y ),
where f(X), g(Y ) are similarity transformations of H.
3. δ(X,Y ) = 1 if and only if Y = f(X) with probability 1, where f(X) is a similarity transformation of H.
4. δ(X,Y ) is continuous; that is, if (Xi,Yi) ∈ S, i = 1, 2, ... such that for some positive
2 2 constant C we have E(|Xi| + |Yi| ) ≤ C, i = 1, 2, ... and (Xi,Yi) converges in distribution
to (X,Y ) then δ(Xi,Yi) = δ(X,Y ).
The first condition is equivalent to Renyi’s` fourth condition, but the main difference from Renyi’s` 12 axioms is one-to-one invariance is replaced by similarity invariance. Mori´ and Szekely´ showed that maximal correlation satisfies the first three axioms; however, it cannot satisfy the fourth condition because maximal correlation coefficient cannot be continuous and is invariant with respect to all one-to-one Borel functions on the real line.
2.5 Distance Correlation
The concept of distance correlation R, introduced by Szekely,´ Rizzo, and Bakirov (2007) as a powerful measure of dependence measures association of random vectors in arbitrary dimensions.
Distance correlation is the standardized distance covariance, which is defined as a weighted L2 dis- tance between the joint characteristic function and the product of marginal characteristic functions. Describing bivariate and multivariate associations, distance correlation has sufficient power to de- tect linear and nonlinear dependence structure. In the bivariate case, distance correlation is less than the absolute value of the Pearson product-moment correlation coefficient and coincides with the absolute value of correlation in the Bernoulli case. Advantages of using distance correlation include that no distributional assumptions and no computing of the inverse of the covariance ma- trix are required. Furthermore, the corresponding statistic applies to random vectors of arbitrary, not necessarily equal, dimensions. Important properties of distance correlation are that the distance correlation coefficient takes values between 0 and 1, the coefficient equals zero if and only if X and Y are independent, and the coefficient is invariant under general invertible affine transformation. Aspiras-Paler (2015) verified that properties of distance correlation satisfy Renyi’s´ axioms as:
1*. Distance correlation, R, is defined for any pair of random variables X and Y with finite first
moment, i.e., |X|p < ∞ and |Y |q < ∞.
5*. Szekely,´ Rizzo, and Bakirov (2007) stated that if R(X,Y ) = 1, then there exists a vector a, a nonzero real number b, and an orthogonal matrix R, such as Y = a + bXR, for the data matrices X and Y . Aspiras-Paler gave a counterexample when X has standard normal distribution and Y = X3. Then, R(X,Y ) < 1 because Y is not a linear transformation of 13 X. Therefore, R(X,Y ) < 1 when Y is a function of X.
6*. If X and Y are standard normal and Y = X, R(X,Y ) = 1. Let f(X) := X and g(Y ) :=
Y 3, then both f and g are one-to-one functions that map R onto itself. In bivariate case, R(X,Y ) = 1 only if there is a linear relation aX + bY = c between X and Y for any
constants a, b, c ∈ R. Hence R(f(X), g(Y )) = R(X,Y ) does not hold in general.
7*. Szekely,´ Rizzo, and Bakirov (2007) showed that if p = q = 1 with Gaussian distribution then R ≤ |ρ|,
ρ arcsin ρ + p1 − ρ2 − ρ arcsin ρ/2 − p4 − ρ2 + 1 R2(X,Y ) = √ . (2.5.1) 1 + π/3 − 3
Distance correlation formally satisfies all four axioms by Mori´ and Szekely´ (2019) discussed in Section 2.4, because distance correlation is invariant under the transformations of its associated Euclidean group and has the continuity property. There are several statistical tests based on the empirical characteristic functions and the test of distance correlation is one of them. Feuerverger and Mureika (1977) considered asymptotic behavior of the empirical characteristic function. Based on the empirical characteristic functions, Csorgo¨ (1985) studied mutual tests of independence, and Feuerverger (1993) proposed a rank test for bivariate dependence. There has been much work recently to extend the distance correlation. Lyons (2013) extended the distance correlation to two random variables on separable Hilbert spaces of negative type. Zhu et al. (2017) proposed a test of independence based on random projection and distance correlation.
2.6 Local Correlation
Bjerve and Doksum (1993) proposed a local nonparametric dependence function to measure the association between two random variables X and Y. The local linear correlation satisfies some desirable properties of correlation, such as it is between −1 and 1, it is equal to 0 when X and Y are independent, and it is invariant with respect to location and scale changes in X and Y. It 14 does not, however, characterize independence for the non-Gaussian case. Doksum et al. (1994) introduced estimates of the local correlation based on nearest-neighbor estimates of the residual variance function and the regression slope function. Jones (1996) defined a local dependence measure based on nonparametric bivariate distribution and described the properties of the local dependence function, which satisfies some of Renyi’s´ axioms (1959). Another new dependence measure for two real not necessarily linearly related random variables was proposed by Delicado and Smrekar (2009). They expressed their measure by using principal curves and the covariance and the linear correlation along the curve. Furthermore, they showed that desirable properties for their measure also satisfied modified Renyi’s´ axioms. They modified three of Renyi’s´ axioms as
1*. δ(X,Y ) is defined for any pair of random variables (X,Y ) distributed along a curve.
5*. (X,Y ) are two random variables distributed along a curve c with generating variables (S,T ) when δ(X,Y ) = 1 if and only if there is a strict dependence between (X,Y ) and S when
X = c1(S) and Y = c2(S), or equivalently T is identically 0.
6*. If f(X) and g(Y ) are strictly monotone almost surely on ranges of X and Y, respectively, then δ{f(X), g(Y )} = δ(X,Y ).
In general, the above correlation coefficient is suitable for Gaussian data but it fails to characterize independence for non-Gaussian cases. Tjøstheim and Hufthammer (2013) introduced estimation and visualization for a new local measure of dependence called local Gaussian correlation. It is derived from a local correlation function, based on approximating a bivariate density locally from a family of bivariate Gaussian densities using local likelihood. The local parameters and estimation using local likelihood were first described in Hjort and Jones (1996). Tjøstheim and Hufthammer (2013) included in their paper the limiting behavior of a bandwidth algorithm that aims to balance the variance of the estimated local parameters versus the bias of the resulting density estimate and estimation of standard errors. The bandwidth algorithm of Tjøstheim and Hufthammer is not really 15 satisfactory in a general situation. Berentsen and Tjøstheim (2014) introduced another method to choose the bandwidth based on the principle of likelihood cross-validation. They constructed a global measure of dependence by aggregating the local correlations on R2 for linear and nonlinear dependence structures in bivariate data. An advantage of the local Gaussian correlation is that it can distinguish between negative and positive local dependences for bivariate variables. Berentsen and Tjøstheim (2014) proposed an alternative method in copula goodness-of-fit testing for bivariate variables.
2.7 Multiscale Graph Correlation
Shen, Priebe, and Vogelstein (2018) considered local distance correlation computed by uti- lizing the K-nearest neighbor graphs, which they named as multiscale graph correlation. They gave the definition of the population multiscale graph correlation by the characteristic functions of the underlying random variables and K-nearest neighbor graphs and described properties for the multiscale graph correlation which are related to those of distance correlation.
p q Given n pairs of sample data (Xi,Yi) that are i.i.d, that is, (Xi,Yi) ∈ R ×R for i = 1, 2, ..., n,
Shen et al. computed Aij and Bij as column-centered matrices with diagonals excluded as
˜ 1 Pn ˜ Aij − n−1 m=1 Amj i 6= j; Aij = 0 i = j,
˜ 1 Pn ˜ Bij − n−1 m=1 Bmj i 6= j; Bij = 0 i = j,
˜ ˜ where Aij = ||Xi − Xj|| and Bij = ||Yi − Yj|| for i, j = 1, 2, ..., n, respectively. The sample local distance covariance is defined by
1 1 1 Vkl(X,Y ) = Ckl − Ak . Bl , n n(n − 1) ij n(n − 1) ij n(n − 1) ij 16 k A l A where Aij = AijI(Rij ≤ k) and is defined similarly to Bij.I(.) is the indicator function and Rij
A th kl is a rank function of xi relative to xj, that is, Rij = k, if xi is the k nearest neighbor of xj.Cij
kl k l is the joint distance matrix as Cij = Aij × Bji. The sample local distance correlation is defined as the normalization of the local distance covariances, i.e.,
Vkl(X,Y ) Rkl(X,Y ) = n , n p k l Vn(X,X)Vn(Y,Y )
k l where Vn(X,X) and Vn(Y,Y ) are the sample local distance variances for X and Y, respectively. A multiscale graph correlation is defined as maximum local correlation of the largest connected
kl∗ component R and computes all significant local correlations within region R, as Rn (X,Y ), where
∗ kl kl = arg maxkl∈R S(Rn (X,Y )) and S(.) is an operation that filters out all insignificant local correlations. This method will not be used in our research because we will introduce a new method to compute distance correlation locally based on jackknife empirical likelihood. 17
CHAPTER 3 OVERVIEW OF DISTANCE CORRELATION
3.1 Distance Correlation
Distance correlation R, introduced by Szekely,´ Rizzo, and Bakirov (2007), is a new measure of dependence and testing the joint independence between the random vectors X and Y in arbitrary dimensions. For all distributions with finite first moments, distance correlation R generalizes the idea of correlation in at least two fundamental ways:
1. R(X,Y ) is defined for X and Y in arbitrary dimension.
2. R(X,Y ) = 0 characterizes the independence of X and Y .
Distance correlation R has properties of a true dependence measure, analogous to product-moment correlation ρ, but it generalizes and extends classical measures of dependence. Empirical distance dependence measures are based on functions of Euclidean distances between sample elements instead of sample moments.
Definition 3.1.1. The distance covariance between random vectors X and Y with finite first mo-
ments is a nonnegative number for X and Y taking values in Rp and Rq, respectively. This coef-
ficient is defined by a weighted L2 norm measuring the distance between the joint characteristic
function (c.f.) φX,Y of X and Y , and the product φX φY of the marginal c.f.s of X and Y . Distance covariance V(X,Y ) is the nonnegative square root of
2 2 V (X,Y ) = ||φX,Y (t, s) − φX (t)φY (s)||w Z 2 = |φX,Y (t, s) − φX (t)φY (s)| w(t, s)dtds, Rp+q
1+d 1+p 1+q −1 π 2 where w(t, s) = (cpcq|t|p |s|q ) , cd = 1+d , Γ(.) is the complete gamma function. Similarly, Γ( 2 ) 18 distance variances are defined as the square root of
2 2 2 V (X) = V (X,X) = ||φX,X (t, s) − φX (t)φX (s)||w Z 2 = |φX,X (t, s) − φX (t)φX (s)| w(t, s)dtds, Rp+q
2 2 2 V (Y ) = V (Y,Y ) = ||φY,Y (t, s) − φY (t)φY (s)||w Z 2 = |φY,Y (t, s) − φY (t)φY (s)| w(t, s)dtds. Rp+q
Definition 3.1.2. The distance correlation between random vectors X and Y with finite first mo- ments is the nonnegative number R(X,Y ) defined by
V2(X,Y ) √ , V2(X)V2(Y ) > 0; R2(X,Y ) = V2(X)V2(Y ) 0, V2(X)V2(Y ) = 0.
Distance correlation satisfies 0 ≤ R ≤ 1, and R = 0 only if X and Y are independent. In the bivariate normal case, R is a function of ρ, and R(X,Y ) ≤ |ρ(X,Y )| with equality when ρ = ±1.
For an observed random sample {(xi, yi): i = 1, ..., n} from the joint distribution of random
vectors X and Y , aij = ||Xi − Xj|| denotes the pairwise distances of the X observations and
bij = ||Yi − Yj|| denotes the pairwise distances of the Y observations for i, j = 1, ..., n. The
n n corresponding double centered distance matrices by (Aij)i,j=1 and (Bij)i,j=1 are defined by
n n n 1 X 1 X 1 X A = a − a − a + a . (3.1.3) ij ij n il n kj n2 kl l=1 k=1 k,l=1
n n n 1 X 1 X 1 X B = b − b − b + b . (3.1.4) ij ij n il n kj n2 kl l=1 k=1 k,l=1
Definition 3.1.5. The sample distance covariance Vn(X,Y ) and sample distance correlation Rn(X,Y ) 19 are defined by n 1 X V2(X,Y ) = A B , (3.1.6) n n2 ij ij i,j=1 and
2 √ Vn(X,Y ) , V2(X)V2(Y ) > 0; 2 2 n n 2 Vn(X)Vn(Y ) Rn(X,Y ) = 2 2 0, Vn(X)Vn(Y ) = 0, respectively, where the squared sample distance variances are
n 1 X V2(X) = V2(X,X) = A2 , n n n2 ij i,j=1
n 1 X V2(Y ) = V2(Y,Y ) = B2 . n n n2 ij i,j=1
2 The squared sample distance covariance, Vn(X,Y ) equals zero if and only if every sample observation is identical. The distance covariance statistic is nonnegative; therefore, its expected value is positive except in degenerate cases. Hence, the distance covariance is biased for the population coefficient and this bias increases with dimensions. The sample distance correlation
2 Rn coefficient is able to capture nonlinear associations because it is based on a characterization of independence. Moreover, it is sensitive to all types of departures from independence that include nonlinear or nonmonotone dependence structures which can be detected in ever more complex associations. Since the coefficient V2 is defined in terms of the difference of the characteristic function
2 and the marginal characteristic functions, Vn is defined by replacing the characteristic function with empirical characteristic functions. The joint empirical characteristic function of the sample √ p q (X1,Y1), ..., (Xn,Yn) for i = −1, t ∈ R , s ∈ R , is
n 1 X φˆ (t, s) = exp{iht, X i + ihs, Y i}, X,Y n k k k=1 20 the marginal empirical characteristic functions of the sample X and sample Y are
n n 1 X 1 X φˆ (t) = exp{iht, X i}, φˆ (s) = exp{ihs, Y i}, X n k Y n k k=1 k=1
respectively. An empirical version of distance covariance can be defined as
2 ˆ ˆ ˆ 2 Vn(X,Y ) = ||φX,Y (t, s) − φX (t)φY (s)|| .
Szekely´ and Rizzo (2009) proved that an equivalent definition for distance covariance is
V2(X,Y ) = E[|X − X0||Y − Y 0|] + E[|X − X0|]E[|Y − Y 0|] − 2E[|X − X0||Y − Y 00|],
(3.1.7)
where X0 is an independent copy of X; and Y 0,Y 00 are independent copies of Y ; and |X − X0| and |Y − Y 0| are Euclidean distances. Some of the properties of distance correlation and distance covariance include:
1. Vn(X,Y ) and Rn(X,Y ) converge almost surely to V(X,Y ) and R(X,Y ), as n → ∞; that is
lim Vn(X,Y ) = V(X,Y ), n→∞
lim Rn(X,Y ) = R(X,Y ), n→∞
with probability 1.
2. Vn(X,Y ) ≥ 0 and Vn(X) = 0 if and only if every sample observation is identical.
3. 0 ≤ Rn(X,Y ) ≤ 1.
4. If Rn(X,Y ) = 1, then there exists a vector a, a nonzero real number b, and an orthogonal matrix R, such as Y = a + bXR, for the data matrices X and Y . 21 5. For bivariate Gaussian distribution: R ≤ |ρ| and
ρ arcsin ρ + p1 − ρ2 − ρ arcsin ρ/2 − p4 − ρ2 + 1 R2(X,Y ) = √ . (3.1.8) 1 + π/3 − 3
2 2 nVn 1 Pn 1 Pn A test of independence based on nV or , where S2 = 2 |Xk −Xl| 2 |Yk −Yl|, is n S2 n k,l=1 n k,l=1 able to demonstrate independence between the random vectors X and Y. Under the independence
2 hypothesis, the normalized test statistic nVn converges in distribution to a quadratic form S2
∞ X 2 Q = λjZj , (3.1.9) j=1
where Zj are independent standard normal random variables, and λj are nonnegative constants that depend on the distribution of (X,Y ). The expected value of Q is equal to 1. A test of independence rejects the null hypothesis of X and Y when
s nV2 α n ≥ Φ−1 1 − S2 2
has an asymptotic significance level at most α. If E(|X|p + |Y |q) < ∞, then
2 d 1. If X and Y are independent, then nVn −→ Q, as n → ∞, where Q is a nonnegative quadratic S2 form of centered Gaussian random variables (3.1.9) and E[Q] = 1.
2 d 2. If X and Y are independent, then nVn −→ Q1, as n → ∞, where Q1 is a nonnegative
0 0 quadratic form of centered Gaussian random variables and E[Q1] = E|X − X |E|Y − Y |.
2 p p 3. If X and Y are dependent, then nVn −→ ∞, as n → ∞, and nV2 −→ ∞, as n → ∞. S2 n
Rizzo and Szekely´ (2013) developed R package energy with the functions dcor and dcor.test to compute distance correlation coefficient and test the significance of distance correlation. 22 3.2 Modified Distance Correlation
Szekely´ and Rizzo (2013) considered the modified version of the squared distance covariance that resulted in a t-test of multivariate independence applicable in high dimension. The result- ing t-test is unbiased for all signicance levels and for every sample size greater than three. Un- der independence, a transformation of the distance correlation statistic converges to a Student’s t-distribution as p, q → ∞. The Student t-distributed statistic is easily interpretable for high di- mensional data.
2 A modified version of the statistic Vn(X,Y ) avoids the problem of the bias in sample distance
∗ ∗ correlation as p, q → ∞. The modified versions Ai,j of Ai,j and Bi,j of Bi,j are defined as following
n A − aij , i 6= j; ∗ n−1 i,j n Aij = n n−1 (¯ai − a¯) i = j, and
n B − bij , i 6= j; ∗ n−1 i,j n Bij = n ¯ ¯ n−1 (bi − b) i = j.
Definition 3.2.1. The modified distance covariance statistic is
( n n ) U ∗(X,Y ) 1 X n X V∗(X,Y ) = n = A∗ B∗ − A∗ B∗ , (3.2.2) n n(n − 3) n(n − 3) i,j i,j n − 2 i,i i,i i,j=1 i=1
where n X 2 X U ∗(X,Y ) = A∗ B∗ − A∗ B∗ . n i,j i,j n − 2 i,i i,i i6=j i=1
∗ 2 Vn(X,Y ) is an unbiased estimator of the squared population distance covariance, V (X,Y ). 23 Definition 3.2.3. The modified distance correlation statistic is
∗ √ Vn(X,Y ) , V∗(X)V∗(Y ) > 0; ∗ ∗ n n ∗ Vn(X)Vn(Y ) Rn(X,Y ) = ∗ ∗ 0, Vn(X)Vn(Y ) = 0.
∗ ∗ The original Rn statistic is between 0 and 1, but Rn can take on negative values. The Rn statistic converges to the square of the population distance correlation R2 stochastically. The test statistic for independence in high dimension is
√ ∗ Rn Tn = v − 1. , p ∗ 2 1 − (Rn)
n(n−3) where v = 2 . As p and q tend to infinity, under the independence hypothesis, Tn converges in distribution to Student t-distribution with v − 1 degrees of freedom. Szekely´ and Rizzo (2013) obtained an asymptotic Z-test of independence in high dimension. Under independence of X and Y , if X and Y are i.i.d with positive finite variance, the limit
∗ distribution of (1 + Rn(X,Y ))/2 is a symmetric beta distribution with shape parameter (v − 1)/2. √ ∗ In high dimension, the large sample distribution of v − 1Rn(X,Y ) is approximately standard normal. Rizzo and Szekely´ developed R package energy with the function dcort.test to implement the unbiased test for independence.
3.3 Unbiased Distance Correlation
Szekely´ and Rizzo (2014) considered the unbiased estimator of squared distance covariance V2(X,Y ), which is based on U-centering instead of the classical method of double centering. The
U-centered distance covariance is the inner product in the Hilbert space Hn of U-centered distance matrices for samples size n, and it is unbiased for the squared population distance covariance.
2 The definition of distance covariance, Vn(X,Y ), has a double centering with matrices Aij and Bij that have the property that all rows and columns sum to zero. Another type of centering ˜ ˜ is U-centering with matrices denoted by Aij and Bij, which has the additional property that all ˜ ˜ ˜ expectations are zero; that is, E[Aij] = 0 and E[Bij] = 0 for all i, j. Let A be a U-centered 24 distance matrix. Then
1. Rows and columns of A˜ sum to zero.
˜ ˜ ˜ ˜ 2. (gA) = A, that is, if B is the matrix obtained by U-centering an element A ∈ Hn,B = A.
3. A˜ is invariant to double centering. If B is the matrix obtained by double centering the matrix A˜, then B = A˜.
4. If c is a constant and B denotes the matrix obtained by adding c to the off-diagonal elements of A˜, then B˜ = A˜.
Definition 3.3.1. Let A = (aij) and B = (bij) be a symmetric, real valued n × n matrix with zero diagonal, n > 2. Define the U-centered matrix A˜ and B˜ as follows. Let the (i, j)-th entry of A˜ and B˜ be
a − 1 Pn a − 1 Pn a + 1 Pn a , i 6= j; ˜ ij n−2 l=1 il n−2 k=1 kj (n−1)(n−2) k,l=1 kl Aij = (3.3.2) 0 i = j, and
b − 1 Pn b − 1 Pn b + 1 Pn b , i 6= j; ˜ ij n−2 l=1 il n−2 k=1 kj (n−1)(n−2) k,l=1 kl Bij = (3.3.3) 0 i = j, respectively.
Proposition 3.3.4. Let (xi, yi), i = 1, ..., n denote a sample of observations from the joint distri- bution (X,Y ) of random vectors X and Y . Let A = (aij) be the Euclidean distance matrix of the sample x1, ..., xn from the distribution of X, and B = (bij) be the Euclidean distance matrix of the sample y1, ..., yn from the distribution of Y . Then if E(|X| + |Y |) < ∞, for n > 3, the following
1 X U 2(X,Y ) = (A,˜ B˜) = A˜ B˜ (3.3.5) n n(n − 3) ij ij i6=j 25 is an unbiased estimator of the squared population distance covariance V2(X,Y ).
2 2 Proof. We have to show that Un(X,Y ) is an unbiased estimator of V (X,Y ) based on the proof in Szekely´ and Rizzo (2014). Since the population coefficient V2(X,Y ) statistic in (3.1.7) is linear combination of distances, we define the expected values as
0 0 α := E[aij] = E[|X − X |], β := E[bij] = E[|Y − Y |], i 6= j,
0 00 δ := E[aijbil] = E[|X − X ||Y − Y |], i, j, l distinct,
0 0 γ := E[aijbij] = E[|X − X ||Y − Y |], i 6= j,
where (X,Y ), (X0,Y 0), (X00,Y 00) are i.i.d. We can write the population coefficient V2(X,Y ) in linear combination of α, β, δ, and γ as
V2(X,Y ) = E[|X − X0||Y − Y 0|] + E[|X − X0|]E[|Y − Y 0|] − 2E[|X − X0||Y − Y 00|]
= γ + αβ − 2δ.
a := ai. a := a.j a := a.. a = Pn a Let’s denote the notation ei. n−2 , e.j n−2 , and e.. (n−1)(n−2) , where i. j=1 ij, Pn Pn a.j = j=1 aij, and a.. = i,j=1 aij. Define ebi.,eb.j, and eb.. similarly. We further have
aijbij −aijebi. −aijeb.j +aijeb.. −ai.bij +ai.bi. +ai.b.j −ai.b.. ∗ P e e e e e e e n(n − 3) Un(X,Y ) = i6=j −a b +a b +a b −a b e.j ij e.jei. e.je.j e.je.. +ea..bij −ea..ebi. −ea..eb.j +ea..eb.. 26 P P P = aijbij − ai.ebi. − a. jeb. j +a..eb.. i6=j i j P P P P − eai.bi. +(n − 1) eai.ebi. + eai.eb. j −(n − 1) ai.eb.. i i i6=j i P P P P − ea. jb. j + ea. jebi. +(n − 1) ea. jeb. j −(n − 1) ea. jeb.. j i6=j j j P P +ea..b.. −(n − 1) ea..ebi. −(n − 1) ea..eb. j +n(n − 1)ea..eb... i j
Let’s obtain X X T1 = aijbij,T2 = a..b..,T3 = ai.bi.. i6=j i Then
T3 T3 T2 T1 − n−2 − n−2 + (n−1)(n−2) (n−1)T − T3 + 3 + T2−T3 − T2 2 n−2 (n−2)2 (n−2)2 (n−2)2 n(n − 3)Un(X,Y ) = − T3 + T2−T3 + (n−1)T3 − T2 n−2 (n−2)2 (n−2)2 (n−2)2 T2 T2 T2 nT2 + (n−1)(n−2) − (n−2)2 − (n−2)2 + (n−1)(n−2)2
After simplification, we further have
T 2T n(n − 3)U 2(X,Y ) = T + 2 − 3 . (3.3.6) n 1 (n − 1)(n − 2) n − 2
It is obvious that E[T1] = n(n − 1)γ. When we expand the terms of T2 and T3 and combine terms that have equal expected values, we can obtain
E[T2] = n(n − 1){(n − 2)(n − 3)αβ + 2γ + 4(n − 2)δ}, 27 and
E[T3] = n(n − 1){(n − 2)δ + γ}.
Therefore,
1 T 2T E[U 2(X,Y )] = E T + 2 − 3 n n(n − 3) 1 (n − 1)(n − 2) n − 2 1 n3 − 5n2 + 6n = γ + n(n − 3)αβ + (6n − 2n2)δ n(n − 3) n − 2 = γ + αβ − 2δ = V2(X,Y ).
2 The statistic Un(X,Y ) is an inner product in the Hilbert space Hn of U-centered distance matrices, and the corresponding inner product (3.3.5) defines an unbiased estimator of the squared distance covariance. Hence, A˜ = 0 if and only if the n sample observations are equally distant or
2 at least n − 1 of the n sample observations are identical. A bias-corrected Rn(X,Y ) is defined by normalizing the inner product statistic with the bias-corrected distance variance statistics.
2 √ Un(X,Y ) , U 2(X)U 2(Y ) > 0; 2 2 n n ∗∗ Un(X)Un(Y ) Rn (X,Y ) = (3.3.7) 2 2 0, Un(X)Un(Y ) = 0,
2 where Un(X,Y ), defined in (3.3.5), is unbiased estimator of distance covariance of X and Y , and the unbiased squared sample distance variances of X, and Y, respectively, are
1 X U 2(X) = (A,˜ A˜) = A˜ A˜ , n n n(n − 3) ij ij i6=j (3.3.8) 1 X U 2(Y ) = (B,˜ B˜) = B˜ B˜ , n n n(n − 3) ij ij i6=j
∗∗ ∗∗ The Rn can take negative values; hence, we cannot define the bias-corrected Rn (X,Y ) statistic to be the square root of it. Notice that the original distance covariance, which is defined in (3.1.6), is a V -statistic, and its unbiased versions are U-statistics. 28 ∗∗ The bias-corrected distance correlation statistic, Rn , and the unbiased estimator of distance
2 covariance, Un, are implemented in the R energy package by the bcdcor and dcovU functions.
2 ∗∗ The computation of Un and Rn can be implemented directly from its definitions but that time complexity is O(n2) which is high as a constant times n2 for sample size n. A fast formula for a biased estimator of V2(X,Y ) can be derived by combining the double
centered distance matrix Aij, which is defined in (3.1.3) and Bij, which is defined in (3.1.4). After simplification, the corresponding V -statistic is
n n 1 X 2 X a..b.. V2(X,Y ) = a b − a b + , (3.3.9) n n2 ij ij n3 i. i. n4 i,j=1 i=1
where the row i sum, column j sum, and grand sum of the distance matrix (aij) and (bij) are defined as: n n X X ai. = ail, a.j = akj, l=1 k=1
n n X X bi. = bil, b.j = bkj, l=1 k=1
n n X X a.. = akl, b.. = akl. k,l=1 k,l=1
In addition, a faster computing formula for an unbiased estimator of V2(X,Y ) can be derived by ˜ ˜ combining the U-centered matrix Aij and Bij, defined in Definition 3.3.1, and can be simplified to
n 1 X 2 X a..b.. U 2(X,Y ) = a b − a b + , n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1 (3.3.10)
2 2 Huo and Szekely´ (2016) showed that Un is a U-statistic. Since the fast formula Un(X,Y ) is a P Pn linear combination, we have to compute the three terms i6=j aijbij, i=1 ai.bi., and a..b... 29
In order to compute ai., we have the following
n X X ai. = ail = |Xi − Xl| l=1 i X X = (Xi − Xl) + (Xl − Xi) X Xl i=1 Xi≤Xl where P X is a partial sum and r is a rank of X . We compute b similarly. Therefore, Xi≤Xl i i i i. Pn the second term is obtained by taking the summation of ai.bi.. It is clear that a.. = i=1 ai. and Pn b.. = i=1 bi.; hence, the third term is a..b... As we see above, the computing of the second and the third terms are easy and fast. But when we compute the first term in (3.3.10), it is not straightforward, but the dyadic approach can be applied using binary search algorithm. The first term follows as X X aijbij = |Xi − Xj||Yi − Yj| i6=j i6=j n (3.3.12) X X X X X = XiYi γij − Xi Yjγij − Yj Xjγij + XjYjγij, i=1 i6=j i6=j i6=j i6=j where γij is a sign function, for all 1 ≤ i, j ≤ n, +1, if(Xi − Xj)(Yi − Yj) > 0, γij = (3.3.13) −1 otherwise. We have implemented a fast O(n log n) algorithm for the computation of sample distance co- variances in the bivariate case for both versions, U-statistic and V -statistic, and the function dcov2d in energy package in R can be applied to very large datasets. This implementation is fast because it does not store the distance matrices. Since the sample distance correlation is a normalized version 30 of sample distance covariances, the function dcor2d in energy package in R and dcov2d computes an unbiased sample distance covariance which is a U-statistic and the original sample distance covariance which is V -statistic. 31 CHAPTER 4 CONFIDENCE INTERVAL FOR DISTANCE CORRELATION Distance correlation measures both linear and nonlinear dependence between random vectors. Szekely,´ Rizzo, and Bakirov (2007) proposed the sample distance correlation which is the point estimate of the population distance correlation coefficient R2 and distance correlation test the hypothesis of independence. However, an alternative way to estimate distance correlation is to calculate a confidence interval which is a range of values that are likely to contain the population distance correlation R2. Let (X1,Y1),..., (Xn,Yn) be drawn i.i.d from the joint distribution FXY . Then, we compute an interval that likely contains the true value of distance correlation R2. If I(X,Y ) is a confidence interval for R2 with confidence level α 2 P(R ∈ I((X1,Y1), ..., (Xn,Yn))) = α. The bootstrap confidence interval is the most common nonparametric method used for interval estimation. Bootstrap methods first introduced by Efron (1979) apply a resampling method that can approximate the distribution of a population. However, bootstrap confidence interval may fail to include the true parameter value in some cases, according to Carpenter and Bithell (2000). The empirical likelihood method introduced by Owens (1990) is also used to construct the confidence interval. This method is widely used in statistics with a linear functional; however, the method of Lagrange multipliers for optimization cannot be solved easily for a nonlinear functional. Therefore, a direct application of empirical likelihood to calculate the confidence interval fails to obtain a chi-square limit. We are not able to use this method since distance correlation is a nonlinear statistical function. A jackknife empirical likelihood method proposed by Jing, Yuan, and Zhou (2009) can be applied to a nonlinear U-statistic. This method is a modified version of empirical likelihood method and it is a simple method to use. In this chapter, we introduce a confidence interval for distance 32 correlation based on a jackknife empirical likelihood method. To date no research has been found on the construction of confidence intervals for distance correlation. 4.1 Confidence Intervals for Distance Correlation 2 Distance covariance Un which was defined in Proposition 3.3.4 is an unbiased estimator of the squared population distance covariance V2. Proposition 3.3.4 has been proven by Szekely´ and 2 Rizzo (2014), where the expected value of Un is equal to the squared population distance covari- ance V2. We also had considered the proof of Proposition 3.3.4 in Chapter 3, because we focus on an unbiased estimator of sample distance covariance. First, we show that squared distance covari- 2 ∗∗ ance Un can be represented as a U-statistic, because distance correlation Rn , which was defined 2 in (3.3.7), is the standardized version of Un. Second, we construct the jackknife pseudo-samples ∗∗ defined by Quenouille (1956) for distance correlation Rn as a sample of asymptotically indepen- dent observations; the jackknife estimator for distance correlation becomes the sample mean of jackknife pseudo-samples. Then, the empirical likelihood method can be applied to construct a confidence interval for the mean of jackknife pseudo-samples of distance correlation. 4.1.1 U-statistic Results A U-statistic is an alternative way to construct an unbiased estimator. The basic theory of U- statistics was introduced by Hoeffding (1948a). However, we adapt a definition of a U-statistic as stated in Serfling (2009, Ch.5) for distance covariance: Let a sample (X1,Y1), ..., (Xn,Yn) be i.i.d. 2 from FXY . The U-statistic for estimation of V on the basis of a sample (X1,Y1), ..., (Xn,Yn) is obtained by averaging a symmetric the kernel h over the observations of a sample size 4, which is 2 the smallest sample size for estimating Un. That is, −1 n X U 2(X,Y ) = h((X ,Y ), ..., (X ,Y )), (4.1.1) n 4 k1 k1 k4 k4 1≤k1<...