LOCAL DISTANCE CORRELATION: AN EXTENSION OF LOCAL GAUSSIAN CORRELATION

Walaa Hamdi

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

August 2020

Committee:

Maria Rizzo, Advisor

Jari Willing, Graduate Faculty Representative

Wei Ning

Junfeng Shang Copyright c August 2020 Walaa Hamdi All rights reserved iii ABSTRACT

Maria Rizzo, Advisor

Distance correlation is a measure of the relationship between random vectors in arbitrary di- mension. A sample distance can be formulated in both an unbiased estimator and a biased estimator of distance covariance, where distance correlation is defined as the normalized coefficient of distance covariance. The jackknife empirical likelihood for a U-statistic by Jing, Yuan, and Zhou (2009) can be applied to a distance correlation since the empirical likelihood method fails in nonlinear . A Wilks’ theorem for jackknife empirical likelihood is shown to hold for distance correlation. This research shows how to construct a confidence interval for distance correlation based on jackknife empirical likelihood for a U-statistic, where the sample distance covariance can be represented as a U-statistic. In comparing coverage probabilities of confidence intervals for distance correla- tion based on jackknife empirical likelihood and bootstrap method, coverage probabilities for the jackknife empirical likelihood show more accuracy. We propose the estimation and the visualization of local distance correlation by using a local version of the jackknife empirical likelihood. The kernel density functional estimation is used to construct the jackknife empirical likelihood locally. The bandwidth selection for kernel function should minimize the distance between the true density and estimated density. Local distance correlation has the property that it equals zero in the neighborhood of each point if and only if the two variables are independent in that neighborhood. The estimation and visualization of local distance correlation are shown as accurate to capture the local dependence when compared with the local Gaussian correlation in simulation studies and real examples. iv

My thanks to My Mom Nawal and Dad Ahmed; My husband Motaz, for his support; My daughters Rafif, Joanna, and Taleen v ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Rizzo for her precious advice and guidance to complete this dissertation. I appreciate her way of giving me different viewpoints and suggestions. I also appreciate her time to meet with me to discuss the results in more detail. I am honored that Dr. Rizzo is my dissertation advisor. I express my sincere thanks to committee members Dr. Junfeng Shang, Dr. Wei Ning, and Dr. Jari Willing, who supported me until I completed my degree. I also express my appreciation to all my professors in the Department of Mathematics and Statistics for their help and guidance. I am especially thankful to the Graduate Coordinator Dr. Craig Zirbel for his support of graduate students. I cannot express enough thanks to my husband Motaz and my parents for encouraging me to complete my degree with their best wishes. I will always remember my husband Motaz supporting me and always being by my side. I love my daughters and they always make me happy. I extend my gratitude to all of my family and my husband’s family who are directly or indirectly supporting me to complete my degree. Finally, I am glad to be a graduate student at Bowling Green State University. vi TABLE OF CONTENTS Page

CHAPTER 1 INTRODUCTION ...... 1

CHAPTER 2 LITERATURE REVIEW ...... 5 2.1 Background on Dependence Coefficients ...... 5 2.2 Bivariate Dependence Measure ...... 5 2.3 Multivariate Dependence Measure ...... 8 2.4 Properties of Dependence Measure ...... 9 2.5 Distance Correlation ...... 12 2.6 Local Correlation ...... 13 2.7 Multiscale Graph Correlation ...... 15

CHAPTER 3 OVERVIEW OF DISTANCE CORRELATION ...... 17 3.1 Distance Correlation ...... 17 3.2 Modified Distance Correlation ...... 22 3.3 Unbiased Distance Correlation ...... 23

CHAPTER 4 CONFIDENCE INTERVAL FOR DISTANCE CORRELATION ...... 31 4.1 Confidence Intervals for Distance Correlation ...... 32 4.1.1 U-statistic Results ...... 32 4.1.2 Jackknife Empirical Likelihood for Distance Correlation ...... 39 4.2 Simulation Study ...... 47 4.3 Real Examples ...... 51

CHAPTER 5 LOCAL DISTANCE CORRELATION ...... 55 5.1 Local Gaussian Correlation ...... 55 5.1.1 Estimation of Local Gaussian Correlation ...... 56 5.1.2 Choice of Bandwidth for Kernel Function ...... 57 vii 5.1.3 Properties of Local Gaussian Correlation ...... 58 5.1.4 Global Gaussian Correlation ...... 59 5.2 Local Distance Correlation ...... 61 5.2.1 Estimation of Local Distance Correlation ...... 61 5.2.2 Choice of Bandwidth for Kernel Function ...... 67 5.2.3 Properties of Local Distance Correlation ...... 76 5.3 Simulation Study ...... 82 5.4 Real Examples ...... 91 5.4.1 Example 1: Aircraft ...... 91 5.4.2 Example 2: Wage ...... 92 5.4.3 Example 3: PRIM7 ...... 95 5.4.4 Example 4: Olive Oils ...... 97

CHAPTER 6 SUMMARY AND FUTURE WORK ...... 104

BIBLIOGRAPHY ...... 107

APPENDIX A SELECTED R PROGRAMS ...... 113 viii LIST OF FIGURES Figure Page

4.1 Scatterplot matrix of pairwise association of six fatty acids ...... 53

5.1 Contour plots of the true density and kernel estimate functions ...... 75 5.2 Scatter plot of X and Y ...... 77 5.3 Illustration of exchange symmetry ...... 78 5.4 Illustration of reflection symmetry ...... 79 5.5 Illustration of radial symmetry ...... 80 5.6 Scatter plots when rotated for 90o and 180o ...... 81 5.7 Illustration of rotation symmetry ...... 82 5.8 Scatter plots of different bivariate dependence structures ...... 83 5.9 Contour plots of different bivariate dependence structures ...... 84 5.10 The visualization of local Gaussian correlation and local distance correlation . . . 85 5.11 The visualization of local Gaussian correlation and local distance correlation . . . 86 5.12 The visualization of local Gaussian correlation and local distance correlation . . . 87 5.13 The visualization of local Gaussian correlation and local distance correlation . . . 88 5.14 The visualization of local Gaussian correlation and local distance correlation . . . 89 5.15 The visualization of local Gaussian correlation and local distance correlation . . . 90 5.16 Scatter and contour plots for aircraft dataset ...... 92 5.17 The visualization of local Gaussian correlation and local distance correlation for aircraft dataset ...... 93 5.18 Smooth scatter plot for Wage dataset ...... 94 5.19 The visualization of local Gaussian correlation and local distance correlation for Wage dataset ...... 95 5.20 Scatter and smooth scatter plots for PRIM7 dataset ...... 96 5.21 The visualization of local Gaussian correlation and local distance correlation for PRIM7 dataset ...... 97 ix 5.22 Smooth scatter plot for oleic and palmitoleic fatty acids ...... 98 5.23 The visualization of local Gaussian correlation and local distance correlation for oleic and palmitoleic fatty acids ...... 99 5.24 Smooth scatter plot for palmitic and steartic fatty acids ...... 100 5.25 The visualization of local Gaussian correlation and local distance correlation for palmitic and steartic fatty acids ...... 101 5.26 Smooth scatter plot for linoleic and linolenic fatty acids ...... 102 5.27 The visualization of local Gaussian correlation and local distance correlation for linoleic and linolenic fatty acids ...... 103 x LIST OF TABLES Table Page

4.1 Coverage probabilities and average interval lengths of 90% confidence interval for R2 ...... 50 4.2 Coverage probabilities and average interval lengths of 95% confidence interval for R2 ...... 50 4.3 Coverage probabilities and average interval lengths of 99% confidence interval for R2 ...... 51 4.4 of fatty acids ...... 52 4.5 The confidence intervals for bias-corrected distance correlation of bivariate vari- ables of monounsaturated fats, saturated fats, and polyunsaturated fats ...... 54 1

CHAPTER 1 INTRODUCTION

Correlation is a bivariate coefficient which measures the association or relationship between two random variables. The correlation coefficient is one of the interesting topics in statistics, be- cause statisticians have been developing different ways to quantify the relationship between vari- ables and properties of dependence measure. We can find a point estimate and calculate confidence intervals to estimate the population correlation. Point estimation is used to calculate a single value for estimating the population correlation coefficient and the confidence intervals are defined as a range of values that contains the population correlation coefficient. Moreover, a hypothesis test for the population correlation coefficient is used to evaluate two mutually exclusive statements about a population from sample data. Pearson correlation is the most commonly used method to study the relationship between two random variables, but it fails to capture nonlinear dependence. For non-Gaussian random variables, the correlation coefficient value can be close to zero, even if the variables are dependent. Szekely,´ Rizzo, and Bakirov (2007) introduced distance correlation, a nonparametric approach, which is a new measure of testing multivariate dependence between random vectors. Distance correlation is analogous to the product- Pearson correlation coefficient, but distance correlation is able to detect linear and nonlinear dependence structure. The distance correlation is defined by normalized coefficient of distance covariance, where the sample distance covariance has both a U-statistic and V -statistic representation. One goal of this research is to construct the confidence interval for distance correlation of mul- tivariate dependence between random vectors. The confidence intervals carry more information about the population correlation than the results of hypothesis test, since the confidence interval provides a range of likely values of the population distance correlation coefficient. The bootstrap method for the construction of confidence interval is the most widely used method for a nonpara- metric approach, but the bootstrap may fail to give information about the population. The empirical likelihood method is also used to construct the confidence interval, but it fails to obtain a chi-square 2 limit for nonlinear functions. Jing, Yuan, and Zhou (2009) proposed jackknife empirical likelihood method for a U-statistic that benefits from simple optimization utilizing jackknife pseudo-samples. The jackknife esti- mator for the parameter of interest becomes the sample mean of jackknife pseudo-samples, and the empirical likelihood method can be applied for jackknife pseudo-sample mean. Szekely´ and Rizzo (2014) considered the unbiased estimator of squared distance covariance, which is based on U-centering. A bias-corrected distance correlation is defined by normalizing the inner product statistic with the bias-corrected distance statistics. Since an unbiased distance covaiance is a U-statistic, in this research we employ the jackknife empirical likelihood method to construct confidence intervals for the distance correlation. Moreover, we show that a Wilks’ theorem holds for jackknife empirical likelihood for distance correlation. The coverage probability and interval length are associated with confidence interval estimation. We construct the jackknife empirical likelihood confidence intervals for the distance correlation with Monte Carlo simulation that provides information about the accuracy of coverage probability and the average of interval length. Criteria for a good confidence interval estimator is to have a coverage probability close to the nominal level and short interval length. In this paper, we compare the jackknife empirical likelihood confidence intervals and standard normal bootstrap confidence intervals for the distance correlation by computing coverage probabilities and interval lengths. Sometimes the global measures of dependence cannot give enough information about the asso- ciation between random variables, because the global measure gives information about the relation between variables in the whole study area. A method of local dependence measure called local Gaussian correlation was considered by Tjøstheim and Hufthammer (2013). It was derived from a local correlation function using local likelihood based on approximating a bivariate density lo- cally from a family of bivariate Gaussian densities. At each point, the correlation coefficient of the approximation of Gaussian distribution is taken as the local correlation. The estimation of local likelihood is based on measuring a density function by a known parametric family. The band- width algorithm of Tjstheim and Hufthammer is not really satisfactory in a general situation but 3 Berentsen and Tjøstheim (2014) introduced another method to choose the bandwidth based on the principle of likelihood cross-validation. An important application of local Gaussian correlation is the visualization of local dependence structures. Berentsen and Tjøstheim (2014) considered the global Gaussian correlation by aggregating local Gaussian correlation on subsets of R2 to get a global measure of dependence. The focus of this research is to introduce a new method of estimation and visualization of local distance correlation measure between two univariate random variables. A local distance correlation is able to capture the local dependence structure in a small region which better describes the dependence structures. The approach to construct local distance correlation by a local version of the jackknife empirical likelihood was extended from local Gaussian correlation. To estimate local distance correlation, we use the kernel density functional estimation to construct jackknife empirical likelihood locally. It is important to choose the appropriate bandwidths for the kernel function where the bandwidth is a window taken in order to determine how much of the data within this window are used to estimate each local estimator of distance correlation. We consider three common bandwidth selections for kernel function to determine the appropriate one for the data. The properties of local distance correlation should remain the same as the properties of distance correlation. We have implemented a fast O(n log n) algorithm for the computation of bivariate distance correlation to build a fast visualization tool with local distance correlation that can be applied to very large data sets. We compare the local distance correlation with local Gaussian correlation in order to determine the performance of both methods in nonlinear data sets. This dissertation is organized as follows: In Chapter 2, we discuss the most important develop- ments in the history of correlation coefficient for measuring bivariate and multivariate association as well as provide details about local correlation. In Chapter 3, we provide an overview of distance correlation. In Chapter 4, we construct the confidence interval for distance correlation based on the jackknife empirical likelihood for a U-statistic and show that Wilks’ theorem holds for distance correlation. In addition, we compare the performance between the confidence interval for distance 4 correlation of the jackknife empirical likelihood and the bootstrap method. In Chapter 5, we intro- duce the local distance correlation for bivariate cases and discuss the local Gaussian correlation. The properties of local distance correlation and the choice of bandwidth for the kernel function are discussed as well. We compare the local distance correlation with local Gaussian correlation by simulation studies and real life examples. Chapter 6 presents the summary and future works of this research. 5

CHAPTER 2 LITERATURE REVIEW

2.1 Background on Dependence Coefficients

Correlation is one of the interesting topics in statistics related to scientific discovery and inno- vation, used to indicate the relation between two continuous variables. Galton (1888) first defined the concept of correlation in the following way: “Two variable organs are said to be co-related when the variation of the one is accompanied on the average by more or less variation of the other, and in the same direction. It is easy to see that co-relation must be the consequence of the vari- ations of the two organs being partly due to common causes. If they were in no respect due to common causes, the co-relation would be nil.” Galton (1890) developed his ideas of correlation to better understand the relationship between variables by collecting large data sets. Pearson (1896) used the word “Coefficient of Correlation” in the paper entitled “Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and Panmixia,” where he considered Pearson’s product-moment correlation which is used to find the linear association between two variables and it can be only used on quantitative variables. Pearson’s product-moment correlation was developed after the initial mathematical formulae for correlation discovered by Bravais (1844). Spearman (1904) adapted Pearson’s idea to be applied by substituting ranks for measurements in Pearson’s product-moment formula. Spearman’s rank correlation coefficient, which is known as Spearman’s rho, is one of the oldest formulas based on ranks that measures the association between two variables. In the first part of Spearman’s paper, he determined that a good method of correlation must meet some requirements such as quantitative expression, the significance of the quantity, accuracy, and ease of application. Also, he considered the advantages and disadvantages of comparison by rank.

2.2 Bivariate Dependence Measure

Hotelling and Pabst (1936) published a paper on rank correlation which avoids the assumption of normality. They defined the most convenient formula for computing the rank correlation when 6 Pearson’s correlation is applied to ranked data. However, they concluded in their paper that the rank correlation coefficient is easier to compute for samples smaller than 40 because Pearson’s correlation works conveniently for larger sample sizes. A new measure of rank correlation which is used to measure association between two random variables, was defined by Kendall (1938). Kendall’s statistic, known as Kendall’s tau, is based on concordant and discordant pairs. It is mostly used in psychological experiments when the ranks are known. Hoeffding (1948b) introduced a test for independence between two data sets, called Hoeffding’s D measure, by measuring the distance between the product of marginal probability distributions and the joint distribution. Blomqvist (1950) considered a simpler dependence measure, based on the number of sample

data n1 that belongs to the first or third quadrant and the number n2 that belongs to the second or fourth quadrant. It is defined as

0 n − n 2n 0 q = 1 2 = 1 − 1, (−1 ≤ q ≤ 1). n1 + n2 n1 + n2

This dependence measure is often called the medial correlation coefficient. The population version of the medial correlation coefficient for a pair of continuous random variables X and Y with medians x˜ and y,˜ respectively, is given as

q = P {(X − x˜)(Y − y˜) > 0} − P {(X − x˜)(Y − y˜) < 0}.

The asymptotic efficiency of a test based on the statistic q is about 41 percent, and this test is similar to Fisher (1941) in the special case of the exact test of independence in the 2 × 2 table. Fisher’s exact test was proposed in 1941 and is used when members of two independent groups can fall into one of two mutually exclusive categories to determine whether the proportions of those falling into each category differ by group. An ordinal invariant measure of association for bivariate populations were discussed by Kruskal (1958). He focused on the probabilistic and operational interpretations of their population val- 7 ues and discussed the relationships between three measures: the quadrant measure of Blomqvist, Kendall’s tau, and Spearman’s rho. Blum, Kiefer, and Rosenblatt (1961) suggested a test of independence based on empirical dis- tribution functions for the case when dimension d ≥ 2, via appropriate Cramer-von Mises statistics, significant for large values of

Z d Y 2 Bn = (Sn(r) − Snj(rj)) dSn(r), (2.2.1) j=1

where Sn(r) is the sample distribution function of independent random d−variate vectors X1, ..., Xn with unknown distribution function F, and Snj is the marginal distribution function associated with

th the j component of the Xi. They obtained the characteristic functions of the limiting distribu- tions of a class of such test criteria and provided the table in the bivariate case, d = 2, for the corresponding distribution functions. Furthermore, the tests have asymptotic normal distributions and they are equivalent to the test proposed by Hoeffding (1948b) when d = 2. Bhuchongkul (1964) considered the class of rank tests for independence and proposed a test statistic based on the given form

N X 0 0 T = N −1 E E Z Z , N N,ri N,si N,ri N,si i=1 where

  th 0  1,Xi is the ri smallest of the X s; ZN,ri =  0, otherwise

  th 0 0  1,Yi is the si smallest of the Y s; Z = N,si  0, otherwise

{E } {E0 }, i = 1, ..., N and N,ri and N,si are two sets of constants satisfying certain restrictions and th th taken as the of the ri and si standard normal order statistic from a sample size 8 N. He concluded that the normal scores test is shown to be the locally most powerful rank test and asymptotically as efficient as the parametric correlation coefficient test for specified alternatives E = r E0 = s , when the underlying distributions are normal. If N,ri i and N,si i then the test statistic is equivalent to the Spearman rank correlation statistic.

2.3 Multivariate Dependence Measure

Wilks (1935) introduced the classic parametric test based on

|A| W = , |A11||A22|

where A11 is the r × r sample covariance matrix of the first sample, A22 is s × s sample covariance matrix of the second sample, and A is the partitioned sample covariance matrix. The test is a

(1) likelihood ratio test, which is optimal under the multivariate normal model for testing H0 : xi and (2) (1) (2) xi are independent, where xi and xi for i = 1, ..., n are r−dimensional and s−dimensional d continuous random vectors, respectively. Under H0 with finite fourth moments, −n log W −→

2 χrs, where n is sample size. A disadvantage of the likelihood ratio test is that it is not usable if the dimension grater than the sample size or the distributional assumption does not hold. Sinha and Wieand (1977) extended Bhuchongkul’s rank statistics (1964) for testing multivari- ate independence. They have shown that the test statistic can be expressed as rank statistic which is easy to compute. The test statistic has asymptotic normal distribution and can detect mutual dependence in alternatives which are pairwise independent.

The nonparametric statistic, Qn was introduced by Geiser and Randles (1997) based on interdi- rections for testing whether two vector-valued quantities are independent. The interdirection count measures the regular distance between the two observation vectors relative to the origin and posi- tions of the other observation. This statistic has an intuitive invariance property but it is reduced to the quadrant statistic when the two quantities are univariate. When each vector is elliptically symmetric, Qn has a limiting chi-squared distribution under the null hypothesis of independence. Geiser and Randles compared their test to Wilks’ likelihood ratio criterion when the vectors have 9 heavy-tailed elliptically symmetric distributions, and they gave an example when Qn is resistant to outliers. The Qn test is better than the componentwise quadrant statistic when the vectors are spherically symmetric. In addition, they showed that Qn performs better than the other tests for heavy-tailed distributions and is competitive for distributions with moderate tail weights. The extensions of the quadrant test of Blomqvist (1950) based on spatial signs, which are easier to compute for data in common dimensions, were introduced by Taskinen, Kankainen, and Oja (2003). Their test statistic is asymptotically equivalent to the interdirection test of Gieser and Randles (1997) when the vectors are elliptically symmetric, but it is easier to compute in practice. Taskinen, Oja, and Randles (2005) provided practical, robust alternatives to normal- theory methods and discussed a sequel to the multivariate extension of the quadrant test by Gieser and Randles (1997) as well as Taskinen, Kankainen, and Oja (2003). Taskinen, Oja, and Randles presented new multivariate extensions of Kendall’s tau and Spearman’s rho statistics using the two different approaches. First, interdirection proportions are used to estimate the cosines of angles between centered observation vectors and differences of observation vectors. Second, between affine-equivariant multivariate signs and ranks are used. If each vector is elliptically symmetric, then the test statistic arising from these two approaches appears to be asymptotically equivalent. Szekely,´ Rizzo, and Bakirov (2007) introduced distance correlation which is a unique nonpara- metric approach to measure of dependence between random variables, see Section 2.5 for more details.

2.4 Properties of Dependence Measure

Renyi` (1959), who introduced properties of measures of dependence, proposed seven axioms of a nonparametric measure of dependence for two random variables on a probability space.

1. δ(X,Y ) is defined for any pair of random variables X and Y , no one of them being constant with probability 1.

2. δ(X,Y ) = δ(Y,X). 10 3. 0 ≤ δ(X,Y ) ≤ 1.

4. δ(X,Y ) = 0 if and only if X and Y are independent.

5. δ(X,Y ) = 1 if there exists a strict dependence between X and Y, that is either Y = f(X) or X = g(Y ), where f(X) and g(Y ) are Borel-measurable functions.

6. If f(X) and g(Y ) are the one-to-one Borel-measurable functions on R, then

δ(f(X), g(X)) = δ(X,Y ).

7. If the joint probability distribution of X and Y is normal, then δ(X,Y ) = |ρ(X,Y )|, where ρ(X,Y ) is correlation coefficient between X and Y.

Renyi’s` fifth condition, is not a strong condition because functional relationship is sufficient but not necessary, according to Li (2015). Renyi` showed that the mean square contingency, correlation coefficient, and correlation ratios measures satisfy some of these properties, but the maximal cor- relation coefficient proposed by Gebelein (1941) satisfies all seven axioms of dependence measure. The maximal correlation is defined

0 ρ (X,Y ) = supf,gρ(f(X), g(Y )), where ρ is the Pearson product-moment correlation coefficient and f, g are Borel measurable func- tions. Another measure of dependence that satisfies all of Renyi’s` axioms is the information coef- ficient of correlation, introduced by Linfoot (1957), which is defined for two continuous random variables as r = p1 − exp(−2I(X,Y )), 11 where I(X,Y ) is mutual information for any pair of continuous random variables X and Y, which is defined by Shannon (1948) as

ZZ p(X,Y ) I(X,Y ) = p(X,Y ) log , p(X)p(Y ) where p(X,Y ) is joint bivariate normal distribution and p(X) and p(Y ) are the bivariate normal distributions of X and Y, respectively. Mori´ and Szekely´ (2019) proved that if a dependence measure is defined for bounded noncon- stant real valued random variables and is invariant with respect to one-to-one measurable transfor- mations of the real value, then the dependence measure cannot be weakly continuous; that means when the sample size increases, the empirical values of a dependence measure do not necessarily converge to the population value. They developed four axioms for dependence measures when (X,Y ) ∈ S, where S is a nonempty set of pairs of nondegenerate random variables X and Y which are taking values in Euclidean spaces or separable Hilbert spaces H, as following:

1. δ(X,Y ) = 0 if and only if X and Y are independent.

2. δ(X,Y ) is invariant with respect to all similarity transformations of H; that is,

δ(f(X), g(Y )) = δ(X,Y ),

where f(X), g(Y ) are similarity transformations of H.

3. δ(X,Y ) = 1 if and only if Y = f(X) with probability 1, where f(X) is a similarity transformation of H.

4. δ(X,Y ) is continuous; that is, if (Xi,Yi) ∈ S, i = 1, 2, ... such that for some positive

2 2 constant C we have E(|Xi| + |Yi| ) ≤ C, i = 1, 2, ... and (Xi,Yi) converges in distribution

to (X,Y ) then δ(Xi,Yi) = δ(X,Y ).

The first condition is equivalent to Renyi’s` fourth condition, but the main difference from Renyi’s` 12 axioms is one-to-one invariance is replaced by similarity invariance. Mori´ and Szekely´ showed that maximal correlation satisfies the first three axioms; however, it cannot satisfy the fourth condition because maximal correlation coefficient cannot be continuous and is invariant with respect to all one-to-one Borel functions on the real line.

2.5 Distance Correlation

The concept of distance correlation R, introduced by Szekely,´ Rizzo, and Bakirov (2007) as a powerful measure of dependence measures association of random vectors in arbitrary dimensions.

Distance correlation is the standardized distance covariance, which is defined as a weighted L2 dis- tance between the joint characteristic function and the product of marginal characteristic functions. Describing bivariate and multivariate associations, distance correlation has sufficient power to de- tect linear and nonlinear dependence structure. In the bivariate case, distance correlation is less than the absolute value of the Pearson product-moment correlation coefficient and coincides with the absolute value of correlation in the Bernoulli case. Advantages of using distance correlation include that no distributional assumptions and no computing of the inverse of the covariance ma- trix are required. Furthermore, the corresponding statistic applies to random vectors of arbitrary, not necessarily equal, dimensions. Important properties of distance correlation are that the distance correlation coefficient takes values between 0 and 1, the coefficient equals zero if and only if X and Y are independent, and the coefficient is invariant under general invertible affine transformation. Aspiras-Paler (2015) verified that properties of distance correlation satisfy Renyi’s´ axioms as:

1*. Distance correlation, R, is defined for any pair of random variables X and Y with finite first

moment, i.e., |X|p < ∞ and |Y |q < ∞.

5*. Szekely,´ Rizzo, and Bakirov (2007) stated that if R(X,Y ) = 1, then there exists a vector a, a nonzero real number b, and an R, such as Y = a + bXR, for the data matrices X and Y . Aspiras-Paler gave a counterexample when X has standard normal distribution and Y = X3. Then, R(X,Y ) < 1 because Y is not a linear transformation of 13 X. Therefore, R(X,Y ) < 1 when Y is a function of X.

6*. If X and Y are standard normal and Y = X, R(X,Y ) = 1. Let f(X) := X and g(Y ) :=

Y 3, then both f and g are one-to-one functions that map R onto itself. In bivariate case, R(X,Y ) = 1 only if there is a linear relation aX + bY = c between X and Y for any

constants a, b, c ∈ R. Hence R(f(X), g(Y )) = R(X,Y ) does not hold in general.

7*. Szekely,´ Rizzo, and Bakirov (2007) showed that if p = q = 1 with Gaussian distribution then R ≤ |ρ|,

ρ arcsin ρ + p1 − ρ2 − ρ arcsin ρ/2 − p4 − ρ2 + 1 R2(X,Y ) = √ . (2.5.1) 1 + π/3 − 3

Distance correlation formally satisfies all four axioms by Mori´ and Szekely´ (2019) discussed in Section 2.4, because distance correlation is invariant under the transformations of its associated Euclidean group and has the continuity property. There are several statistical tests based on the empirical characteristic functions and the test of distance correlation is one of them. Feuerverger and Mureika (1977) considered asymptotic behavior of the empirical characteristic function. Based on the empirical characteristic functions, Csorgo¨ (1985) studied mutual tests of independence, and Feuerverger (1993) proposed a rank test for bivariate dependence. There has been much work recently to extend the distance correlation. Lyons (2013) extended the distance correlation to two random variables on separable Hilbert spaces of negative type. Zhu et al. (2017) proposed a test of independence based on random projection and distance correlation.

2.6 Local Correlation

Bjerve and Doksum (1993) proposed a local nonparametric dependence function to measure the association between two random variables X and Y. The local linear correlation satisfies some desirable properties of correlation, such as it is between −1 and 1, it is equal to 0 when X and Y are independent, and it is invariant with respect to location and scale changes in X and Y. It 14 does not, however, characterize independence for the non-Gaussian case. Doksum et al. (1994) introduced estimates of the local correlation based on nearest-neighbor estimates of the residual and the regression slope function. Jones (1996) defined a local dependence measure based on nonparametric bivariate distribution and described the properties of the local dependence function, which satisfies some of Renyi’s´ axioms (1959). Another new dependence measure for two real not necessarily linearly related random variables was proposed by Delicado and Smrekar (2009). They expressed their measure by using principal curves and the covariance and the linear correlation along the curve. Furthermore, they showed that desirable properties for their measure also satisfied modified Renyi’s´ axioms. They modified three of Renyi’s´ axioms as

1*. δ(X,Y ) is defined for any pair of random variables (X,Y ) distributed along a curve.

5*. (X,Y ) are two random variables distributed along a curve c with generating variables (S,T ) when δ(X,Y ) = 1 if and only if there is a strict dependence between (X,Y ) and S when

X = c1(S) and Y = c2(S), or equivalently T is identically 0.

6*. If f(X) and g(Y ) are strictly monotone almost surely on ranges of X and Y, respectively, then δ{f(X), g(Y )} = δ(X,Y ).

In general, the above correlation coefficient is suitable for Gaussian data but it fails to characterize independence for non-Gaussian cases. Tjøstheim and Hufthammer (2013) introduced estimation and visualization for a new local measure of dependence called local Gaussian correlation. It is derived from a local correlation function, based on approximating a bivariate density locally from a family of bivariate Gaussian densities using local likelihood. The local parameters and estimation using local likelihood were first described in Hjort and Jones (1996). Tjøstheim and Hufthammer (2013) included in their paper the limiting behavior of a bandwidth algorithm that aims to balance the variance of the estimated local parameters versus the bias of the resulting density estimate and estimation of standard errors. The bandwidth algorithm of Tjøstheim and Hufthammer is not really 15 satisfactory in a general situation. Berentsen and Tjøstheim (2014) introduced another method to choose the bandwidth based on the principle of likelihood cross-validation. They constructed a global measure of dependence by aggregating the local correlations on R2 for linear and nonlinear dependence structures in bivariate data. An advantage of the local Gaussian correlation is that it can distinguish between negative and positive local dependences for bivariate variables. Berentsen and Tjøstheim (2014) proposed an alternative method in copula goodness-of-fit testing for bivariate variables.

2.7 Multiscale Graph Correlation

Shen, Priebe, and Vogelstein (2018) considered local distance correlation computed by uti- lizing the K-nearest neighbor graphs, which they named as multiscale graph correlation. They gave the definition of the population multiscale graph correlation by the characteristic functions of the underlying random variables and K-nearest neighbor graphs and described properties for the multiscale graph correlation which are related to those of distance correlation.

p q Given n pairs of sample data (Xi,Yi) that are i.i.d, that is, (Xi,Yi) ∈ R ×R for i = 1, 2, ..., n,

Shen et al. computed Aij and Bij as column-centered matrices with diagonals excluded as

  ˜ 1 Pn ˜  Aij − n−1 m=1 Amj i 6= j; Aij =  0 i = j,

  ˜ 1 Pn ˜  Bij − n−1 m=1 Bmj i 6= j; Bij =  0 i = j,

˜ ˜ where Aij = ||Xi − Xj|| and Bij = ||Yi − Yj|| for i, j = 1, 2, ..., n, respectively. The sample local distance covariance is defined by

1  1   1  Vkl(X,Y ) = Ckl − Ak . Bl , n n(n − 1) ij n(n − 1) ij n(n − 1) ij 16 k A l A where Aij = AijI(Rij ≤ k) and is defined similarly to Bij.I(.) is the indicator function and Rij

A th kl is a rank function of xi relative to xj, that is, Rij = k, if xi is the k nearest neighbor of xj.Cij

kl k l is the joint distance matrix as Cij = Aij × Bji. The sample local distance correlation is defined as the normalization of the local distance covariances, i.e.,

Vkl(X,Y ) Rkl(X,Y ) = n , n p k l Vn(X,X)Vn(Y,Y )

k l where Vn(X,X) and Vn(Y,Y ) are the sample local distance for X and Y, respectively. A multiscale graph correlation is defined as maximum local correlation of the largest connected

kl∗ component R and computes all significant local correlations within region R, as Rn (X,Y ), where

∗ kl kl = arg maxkl∈R S(Rn (X,Y )) and S(.) is an operation that filters out all insignificant local correlations. This method will not be used in our research because we will introduce a new method to compute distance correlation locally based on jackknife empirical likelihood. 17

CHAPTER 3 OVERVIEW OF DISTANCE CORRELATION

3.1 Distance Correlation

Distance correlation R, introduced by Szekely,´ Rizzo, and Bakirov (2007), is a new measure of dependence and testing the joint independence between the random vectors X and Y in arbitrary dimensions. For all distributions with finite first moments, distance correlation R generalizes the idea of correlation in at least two fundamental ways:

1. R(X,Y ) is defined for X and Y in arbitrary dimension.

2. R(X,Y ) = 0 characterizes the independence of X and Y .

Distance correlation R has properties of a true dependence measure, analogous to product-moment correlation ρ, but it generalizes and extends classical measures of dependence. Empirical distance dependence measures are based on functions of Euclidean distances between sample elements instead of sample moments.

Definition 3.1.1. The distance covariance between random vectors X and Y with finite first mo-

ments is a nonnegative number for X and Y taking values in Rp and Rq, respectively. This coef-

ficient is defined by a weighted L2 norm measuring the distance between the joint characteristic

function (c.f.) φX,Y of X and Y , and the product φX φY of the marginal c.f.s of X and Y . Distance covariance V(X,Y ) is the nonnegative square root of

2 2 V (X,Y ) = ||φX,Y (t, s) − φX (t)φY (s)||w Z 2 = |φX,Y (t, s) − φX (t)φY (s)| w(t, s)dtds, Rp+q

1+d 1+p 1+q −1 π 2 where w(t, s) = (cpcq|t|p |s|q ) , cd = 1+d , Γ(.) is the complete gamma function. Similarly, Γ( 2 ) 18 distance variances are defined as the square root of

2 2 2 V (X) = V (X,X) = ||φX,X (t, s) − φX (t)φX (s)||w Z 2 = |φX,X (t, s) − φX (t)φX (s)| w(t, s)dtds, Rp+q

2 2 2 V (Y ) = V (Y,Y ) = ||φY,Y (t, s) − φY (t)φY (s)||w Z 2 = |φY,Y (t, s) − φY (t)φY (s)| w(t, s)dtds. Rp+q

Definition 3.1.2. The distance correlation between random vectors X and Y with finite first mo- ments is the nonnegative number R(X,Y ) defined by

 V2(X,Y )  √ , V2(X)V2(Y ) > 0; R2(X,Y ) = V2(X)V2(Y )  0, V2(X)V2(Y ) = 0.

Distance correlation satisfies 0 ≤ R ≤ 1, and R = 0 only if X and Y are independent. In the bivariate normal case, R is a function of ρ, and R(X,Y ) ≤ |ρ(X,Y )| with equality when ρ = ±1.

For an observed random sample {(xi, yi): i = 1, ..., n} from the joint distribution of random

vectors X and Y , aij = ||Xi − Xj|| denotes the pairwise distances of the X observations and

bij = ||Yi − Yj|| denotes the pairwise distances of the Y observations for i, j = 1, ..., n. The

n n corresponding double centered distance matrices by (Aij)i,j=1 and (Bij)i,j=1 are defined by

n n n 1 X 1 X 1 X A = a − a − a + a . (3.1.3) ij ij n il n kj n2 kl l=1 k=1 k,l=1

n n n 1 X 1 X 1 X B = b − b − b + b . (3.1.4) ij ij n il n kj n2 kl l=1 k=1 k,l=1

Definition 3.1.5. The sample distance covariance Vn(X,Y ) and sample distance correlation Rn(X,Y ) 19 are defined by n 1 X V2(X,Y ) = A B , (3.1.6) n n2 ij ij i,j=1 and

 2  √ Vn(X,Y ) , V2(X)V2(Y ) > 0;  2 2 n n 2 Vn(X)Vn(Y ) Rn(X,Y ) =  2 2  0, Vn(X)Vn(Y ) = 0, respectively, where the squared sample distance variances are

n 1 X V2(X) = V2(X,X) = A2 , n n n2 ij i,j=1

n 1 X V2(Y ) = V2(Y,Y ) = B2 . n n n2 ij i,j=1

2 The squared sample distance covariance, Vn(X,Y ) equals zero if and only if every sample observation is identical. The distance covariance statistic is nonnegative; therefore, its expected value is positive except in degenerate cases. Hence, the distance covariance is biased for the population coefficient and this bias increases with dimensions. The sample distance correlation

2 Rn coefficient is able to capture nonlinear associations because it is based on a characterization of independence. Moreover, it is sensitive to all types of departures from independence that include nonlinear or nonmonotone dependence structures which can be detected in ever more complex associations. Since the coefficient V2 is defined in terms of the difference of the characteristic function

2 and the marginal characteristic functions, Vn is defined by replacing the characteristic function with empirical characteristic functions. The joint empirical characteristic function of the sample √ p q (X1,Y1), ..., (Xn,Yn) for i = −1, t ∈ R , s ∈ R , is

n 1 X φˆ (t, s) = exp{iht, X i + ihs, Y i}, X,Y n k k k=1 20 the marginal empirical characteristic functions of the sample X and sample Y are

n n 1 X 1 X φˆ (t) = exp{iht, X i}, φˆ (s) = exp{ihs, Y i}, X n k Y n k k=1 k=1

respectively. An empirical version of distance covariance can be defined as

2 ˆ ˆ ˆ 2 Vn(X,Y ) = ||φX,Y (t, s) − φX (t)φY (s)|| .

Szekely´ and Rizzo (2009) proved that an equivalent definition for distance covariance is

V2(X,Y ) = E[|X − X0||Y − Y 0|] + E[|X − X0|]E[|Y − Y 0|] − 2E[|X − X0||Y − Y 00|],

(3.1.7)

where X0 is an independent copy of X; and Y 0,Y 00 are independent copies of Y ; and |X − X0| and |Y − Y 0| are Euclidean distances. Some of the properties of distance correlation and distance covariance include:

1. Vn(X,Y ) and Rn(X,Y ) converge almost surely to V(X,Y ) and R(X,Y ), as n → ∞; that is

lim Vn(X,Y ) = V(X,Y ), n→∞

lim Rn(X,Y ) = R(X,Y ), n→∞

with probability 1.

2. Vn(X,Y ) ≥ 0 and Vn(X) = 0 if and only if every sample observation is identical.

3. 0 ≤ Rn(X,Y ) ≤ 1.

4. If Rn(X,Y ) = 1, then there exists a vector a, a nonzero real number b, and an orthogonal matrix R, such as Y = a + bXR, for the data matrices X and Y . 21 5. For bivariate Gaussian distribution: R ≤ |ρ| and

ρ arcsin ρ + p1 − ρ2 − ρ arcsin ρ/2 − p4 − ρ2 + 1 R2(X,Y ) = √ . (3.1.8) 1 + π/3 − 3

2 2 nVn 1 Pn 1 Pn A test of independence based on nV or , where S2 = 2 |Xk −Xl| 2 |Yk −Yl|, is n S2 n k,l=1 n k,l=1 able to demonstrate independence between the random vectors X and Y. Under the independence

2 hypothesis, the normalized test statistic nVn converges in distribution to a quadratic form S2

∞ X 2 Q = λjZj , (3.1.9) j=1

where Zj are independent standard normal random variables, and λj are nonnegative constants that depend on the distribution of (X,Y ). The expected value of Q is equal to 1. A test of independence rejects the null hypothesis of X and Y when

s nV2  α n ≥ Φ−1 1 − S2 2

has an asymptotic significance level at most α. If E(|X|p + |Y |q) < ∞, then

2 d 1. If X and Y are independent, then nVn −→ Q, as n → ∞, where Q is a nonnegative quadratic S2 form of centered Gaussian random variables (3.1.9) and E[Q] = 1.

2 d 2. If X and Y are independent, then nVn −→ Q1, as n → ∞, where Q1 is a nonnegative

0 0 quadratic form of centered Gaussian random variables and E[Q1] = E|X − X |E|Y − Y |.

2 p p 3. If X and Y are dependent, then nVn −→ ∞, as n → ∞, and nV2 −→ ∞, as n → ∞. S2 n

Rizzo and Szekely´ (2013) developed R package energy with the functions dcor and dcor.test to compute distance correlation coefficient and test the significance of distance correlation. 22 3.2 Modified Distance Correlation

Szekely´ and Rizzo (2013) considered the modified version of the squared distance covariance that resulted in a t-test of multivariate independence applicable in high dimension. The result- ing t-test is unbiased for all signicance levels and for every sample size greater than three. Un- der independence, a transformation of the distance correlation statistic converges to a Student’s t-distribution as p, q → ∞. The Student t-distributed statistic is easily interpretable for high di- mensional data.

2 A modified version of the statistic Vn(X,Y ) avoids the problem of the bias in sample distance

∗ ∗ correlation as p, q → ∞. The modified versions Ai,j of Ai,j and Bi,j of Bi,j are defined as following

  n A − aij  , i 6= j; ∗  n−1 i,j n Aij =  n  n−1 (¯ai − a¯) i = j, and

    n B − bij , i 6= j; ∗  n−1 i,j n Bij =  n ¯ ¯  n−1 (bi − b) i = j.

Definition 3.2.1. The modified distance covariance statistic is

( n n ) U ∗(X,Y ) 1 X n X V∗(X,Y ) = n = A∗ B∗ − A∗ B∗ , (3.2.2) n n(n − 3) n(n − 3) i,j i,j n − 2 i,i i,i i,j=1 i=1

where n X 2 X U ∗(X,Y ) = A∗ B∗ − A∗ B∗ . n i,j i,j n − 2 i,i i,i i6=j i=1

∗ 2 Vn(X,Y ) is an unbiased estimator of the squared population distance covariance, V (X,Y ). 23 Definition 3.2.3. The modified distance correlation statistic is

 ∗  √ Vn(X,Y ) , V∗(X)V∗(Y ) > 0;  ∗ ∗ n n ∗ Vn(X)Vn(Y ) Rn(X,Y ) =  ∗ ∗  0, Vn(X)Vn(Y ) = 0.

∗ ∗ The original Rn statistic is between 0 and 1, but Rn can take on negative values. The Rn statistic converges to the square of the population distance correlation R2 stochastically. The test statistic for independence in high dimension is

√ ∗ Rn Tn = v − 1. , p ∗ 2 1 − (Rn)

n(n−3) where v = 2 . As p and q tend to infinity, under the independence hypothesis, Tn converges in distribution to Student t-distribution with v − 1 degrees of freedom. Szekely´ and Rizzo (2013) obtained an asymptotic Z-test of independence in high dimension. Under independence of X and Y , if X and Y are i.i.d with positive finite variance, the limit

∗ distribution of (1 + Rn(X,Y ))/2 is a symmetric beta distribution with shape parameter (v − 1)/2. √ ∗ In high dimension, the large sample distribution of v − 1Rn(X,Y ) is approximately standard normal. Rizzo and Szekely´ developed R package energy with the function dcort.test to implement the unbiased test for independence.

3.3 Unbiased Distance Correlation

Szekely´ and Rizzo (2014) considered the unbiased estimator of squared distance covariance V2(X,Y ), which is based on U-centering instead of the classical method of double centering. The

U-centered distance covariance is the inner product in the Hn of U-centered distance matrices for samples size n, and it is unbiased for the squared population distance covariance.

2 The definition of distance covariance, Vn(X,Y ), has a double centering with matrices Aij and Bij that have the property that all rows and columns sum to zero. Another type of centering ˜ ˜ is U-centering with matrices denoted by Aij and Bij, which has the additional property that all ˜ ˜ ˜ expectations are zero; that is, E[Aij] = 0 and E[Bij] = 0 for all i, j. Let A be a U-centered 24 distance matrix. Then

1. Rows and columns of A˜ sum to zero.

˜ ˜ ˜ ˜ 2. (gA) = A, that is, if B is the matrix obtained by U-centering an element A ∈ Hn,B = A.

3. A˜ is invariant to double centering. If B is the matrix obtained by double centering the matrix A˜, then B = A˜.

4. If c is a constant and B denotes the matrix obtained by adding c to the off-diagonal elements of A˜, then B˜ = A˜.

Definition 3.3.1. Let A = (aij) and B = (bij) be a symmetric, real valued n × n matrix with zero diagonal, n > 2. Define the U-centered matrix A˜ and B˜ as follows. Let the (i, j)-th entry of A˜ and B˜ be

  a − 1 Pn a − 1 Pn a + 1 Pn a , i 6= j; ˜  ij n−2 l=1 il n−2 k=1 kj (n−1)(n−2) k,l=1 kl Aij = (3.3.2)  0 i = j, and

  b − 1 Pn b − 1 Pn b + 1 Pn b , i 6= j; ˜  ij n−2 l=1 il n−2 k=1 kj (n−1)(n−2) k,l=1 kl Bij = (3.3.3)  0 i = j, respectively.

Proposition 3.3.4. Let (xi, yi), i = 1, ..., n denote a sample of observations from the joint distri- bution (X,Y ) of random vectors X and Y . Let A = (aij) be the Euclidean distance matrix of the sample x1, ..., xn from the distribution of X, and B = (bij) be the Euclidean distance matrix of the sample y1, ..., yn from the distribution of Y . Then if E(|X| + |Y |) < ∞, for n > 3, the following

1 X U 2(X,Y ) = (A,˜ B˜) = A˜ B˜ (3.3.5) n n(n − 3) ij ij i6=j 25 is an unbiased estimator of the squared population distance covariance V2(X,Y ).

2 2 Proof. We have to show that Un(X,Y ) is an unbiased estimator of V (X,Y ) based on the proof in Szekely´ and Rizzo (2014). Since the population coefficient V2(X,Y ) statistic in (3.1.7) is linear combination of distances, we define the expected values as

0 0 α := E[aij] = E[|X − X |], β := E[bij] = E[|Y − Y |], i 6= j,

0 00 δ := E[aijbil] = E[|X − X ||Y − Y |], i, j, l distinct,

0 0 γ := E[aijbij] = E[|X − X ||Y − Y |], i 6= j,

where (X,Y ), (X0,Y 0), (X00,Y 00) are i.i.d. We can write the population coefficient V2(X,Y ) in linear combination of α, β, δ, and γ as

V2(X,Y ) = E[|X − X0||Y − Y 0|] + E[|X − X0|]E[|Y − Y 0|] − 2E[|X − X0||Y − Y 00|]

= γ + αβ − 2δ.

a := ai. a := a.j a := a.. a = Pn a Let’s denote the notation ei. n−2 , e.j n−2 , and e.. (n−1)(n−2) , where i. j=1 ij, Pn Pn a.j = j=1 aij, and a.. = i,j=1 aij. Define ebi.,eb.j, and eb.. similarly. We further have

   aijbij −aijebi. −aijeb.j +aijeb..       −ai.bij +ai.bi. +ai.b.j −ai.b..  ∗ P e e e e e e e n(n − 3) Un(X,Y ) = i6=j  −a b +a b +a b −a b   e.j ij e.jei. e.je.j e.je..       +ea..bij −ea..ebi. −ea..eb.j +ea..eb..  26 P P P = aijbij − ai.ebi. − a. jeb. j +a..eb.. i6=j i j P P P P − eai.bi. +(n − 1) eai.ebi. + eai.eb. j −(n − 1) ai.eb.. i i i6=j i P P P P − ea. jb. j + ea. jebi. +(n − 1) ea. jeb. j −(n − 1) ea. jeb.. j i6=j j j P P +ea..b.. −(n − 1) ea..ebi. −(n − 1) ea..eb. j +n(n − 1)ea..eb... i j

Let’s obtain X X T1 = aijbij,T2 = a..b..,T3 = ai.bi.. i6=j i Then

  T3 T3 T2  T1 − n−2 − n−2 + (n−1)(n−2)     (n−1)T   − T3 + 3 + T2−T3 − T2  2  n−2 (n−2)2 (n−2)2 (n−2)2  n(n − 3)Un(X,Y ) =  − T3 + T2−T3 + (n−1)T3 − T2   n−2 (n−2)2 (n−2)2 (n−2)2     T2 T2 T2 nT2   + (n−1)(n−2) − (n−2)2 − (n−2)2 + (n−1)(n−2)2 

After simplification, we further have

T 2T n(n − 3)U 2(X,Y ) = T + 2 − 3 . (3.3.6) n 1 (n − 1)(n − 2) n − 2

It is obvious that E[T1] = n(n − 1)γ. When we expand the terms of T2 and T3 and combine terms that have equal expected values, we can obtain

E[T2] = n(n − 1){(n − 2)(n − 3)αβ + 2γ + 4(n − 2)δ}, 27 and

E[T3] = n(n − 1){(n − 2)δ + γ}.

Therefore,

1  T 2T  E[U 2(X,Y )] = E T + 2 − 3 n n(n − 3) 1 (n − 1)(n − 2) n − 2 1 n3 − 5n2 + 6n  = γ + n(n − 3)αβ + (6n − 2n2)δ n(n − 3) n − 2 = γ + αβ − 2δ = V2(X,Y ).

2 The statistic Un(X,Y ) is an inner product in the Hilbert space Hn of U-centered distance matrices, and the corresponding inner product (3.3.5) defines an unbiased estimator of the squared distance covariance. Hence, A˜ = 0 if and only if the n sample observations are equally distant or

2 at least n − 1 of the n sample observations are identical. A bias-corrected Rn(X,Y ) is defined by normalizing the inner product statistic with the bias-corrected distance variance statistics.

 2  √ Un(X,Y ) , U 2(X)U 2(Y ) > 0;  2 2 n n ∗∗ Un(X)Un(Y ) Rn (X,Y ) = (3.3.7)  2 2  0, Un(X)Un(Y ) = 0,

2 where Un(X,Y ), defined in (3.3.5), is unbiased estimator of distance covariance of X and Y , and the unbiased squared sample distance variances of X, and Y, respectively, are

1 X U 2(X) = (A,˜ A˜) = A˜ A˜ , n n n(n − 3) ij ij i6=j (3.3.8) 1 X U 2(Y ) = (B,˜ B˜) = B˜ B˜ , n n n(n − 3) ij ij i6=j

∗∗ ∗∗ The Rn can take negative values; hence, we cannot define the bias-corrected Rn (X,Y ) statistic to be the square root of it. Notice that the original distance covariance, which is defined in (3.1.6), is a V -statistic, and its unbiased versions are U-statistics. 28 ∗∗ The bias-corrected distance correlation statistic, Rn , and the unbiased estimator of distance

2 covariance, Un, are implemented in the R energy package by the bcdcor and dcovU functions.

2 ∗∗ The computation of Un and Rn can be implemented directly from its definitions but that time complexity is O(n2) which is high as a constant times n2 for sample size n. A fast formula for a biased estimator of V2(X,Y ) can be derived by combining the double

centered distance matrix Aij, which is defined in (3.1.3) and Bij, which is defined in (3.1.4). After simplification, the corresponding V -statistic is

n n 1 X 2 X a..b.. V2(X,Y ) = a b − a b + , (3.3.9) n n2 ij ij n3 i. i. n4 i,j=1 i=1

where the row i sum, column j sum, and grand sum of the distance matrix (aij) and (bij) are defined as: n n X X ai. = ail, a.j = akj, l=1 k=1

n n X X bi. = bil, b.j = bkj, l=1 k=1

n n X X a.. = akl, b.. = akl. k,l=1 k,l=1

In addition, a faster computing formula for an unbiased estimator of V2(X,Y ) can be derived by ˜ ˜ combining the U-centered matrix Aij and Bij, defined in Definition 3.3.1, and can be simplified to

n 1 X 2 X a..b.. U 2(X,Y ) = a b − a b + , n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1 (3.3.10)

2 2 Huo and Szekely´ (2016) showed that Un is a U-statistic. Since the fast formula Un(X,Y ) is a P Pn linear combination, we have to compute the three terms i6=j aijbij, i=1 ai.bi., and a..b... 29

In order to compute ai., we have the following

n X X ai. = ail = |Xi − Xl| l=1 i X X = (Xi − Xl) + (Xl − Xi) X

Xl

i=1 Xi≤Xl where P X is a partial sum and r is a rank of X . We compute b similarly. Therefore, Xi≤Xl i i i i. Pn the second term is obtained by taking the summation of ai.bi.. It is clear that a.. = i=1 ai. and Pn b.. = i=1 bi.; hence, the third term is a..b... As we see above, the computing of the second and the third terms are easy and fast. But when we compute the first term in (3.3.10), it is not straightforward, but the dyadic approach can be applied using binary search algorithm. The first term follows as

X X aijbij = |Xi − Xj||Yi − Yj| i6=j i6=j n (3.3.12) X X X X X = XiYi γij − Xi Yjγij − Yj Xjγij + XjYjγij, i=1 i6=j i6=j i6=j i6=j

where γij is a sign function, for all 1 ≤ i, j ≤ n,

  +1, if(Xi − Xj)(Yi − Yj) > 0, γij = (3.3.13)  −1 otherwise.

We have implemented a fast O(n log n) algorithm for the computation of sample distance co- variances in the bivariate case for both versions, U-statistic and V -statistic, and the function dcov2d in energy package in R can be applied to very large datasets. This implementation is fast because it does not store the distance matrices. Since the sample distance correlation is a normalized version 30 of sample distance covariances, the function dcor2d in energy package in R and dcov2d computes an unbiased sample distance covariance which is a U-statistic and the original sample distance covariance which is V -statistic. 31

CHAPTER 4 CONFIDENCE INTERVAL FOR DISTANCE CORRELATION

Distance correlation measures both linear and nonlinear dependence between random vectors. Szekely,´ Rizzo, and Bakirov (2007) proposed the sample distance correlation which is the point estimate of the population distance correlation coefficient R2 and distance correlation test the hypothesis of independence. However, an alternative way to estimate distance correlation is to calculate a confidence interval which is a range of values that are likely to contain the population distance correlation R2.

Let (X1,Y1),..., (Xn,Yn) be drawn i.i.d from the joint distribution FXY . Then, we compute an interval that likely contains the true value of distance correlation R2. If I(X,Y ) is a confidence interval for R2 with confidence level α

2 P(R ∈ I((X1,Y1), ..., (Xn,Yn))) = α.

The bootstrap confidence interval is the most common nonparametric method used for interval estimation. Bootstrap methods first introduced by Efron (1979) apply a method that can approximate the distribution of a population. However, bootstrap confidence interval may fail to include the true parameter value in some cases, according to Carpenter and Bithell (2000). The empirical likelihood method introduced by Owens (1990) is also used to construct the confidence interval. This method is widely used in statistics with a linear functional; however, the method of Lagrange multipliers for optimization cannot be solved easily for a nonlinear functional. Therefore, a direct application of empirical likelihood to calculate the confidence interval fails to obtain a chi-square limit. We are not able to use this method since distance correlation is a nonlinear statistical function. A jackknife empirical likelihood method proposed by Jing, Yuan, and Zhou (2009) can be applied to a nonlinear U-statistic. This method is a modified version of empirical likelihood method and it is a simple method to use. In this chapter, we introduce a confidence interval for distance 32 correlation based on a jackknife empirical likelihood method. To date no research has been found on the construction of confidence intervals for distance correlation.

4.1 Confidence Intervals for Distance Correlation

2 Distance covariance Un which was defined in Proposition 3.3.4 is an unbiased estimator of the squared population distance covariance V2. Proposition 3.3.4 has been proven by Szekely´ and

2 Rizzo (2014), where the expected value of Un is equal to the squared population distance covari- ance V2. We also had considered the proof of Proposition 3.3.4 in Chapter 3, because we focus on an unbiased estimator of sample distance covariance. First, we show that squared distance covari-

2 ∗∗ ance Un can be represented as a U-statistic, because distance correlation Rn , which was defined

2 in (3.3.7), is the standardized version of Un. Second, we construct the jackknife pseudo-samples

∗∗ defined by Quenouille (1956) for distance correlation Rn as a sample of asymptotically indepen- dent observations; the jackknife estimator for distance correlation becomes the sample mean of jackknife pseudo-samples. Then, the empirical likelihood method can be applied to construct a confidence interval for the mean of jackknife pseudo-samples of distance correlation.

4.1.1 U-statistic Results

A U-statistic is an alternative way to construct an unbiased estimator. The basic theory of U- statistics was introduced by Hoeffding (1948a). However, we adapt a definition of a U-statistic as

stated in Serfling (2009, Ch.5) for distance covariance: Let a sample (X1,Y1), ..., (Xn,Yn) be i.i.d.

2 from FXY . The U-statistic for estimation of V on the basis of a sample (X1,Y1), ..., (Xn,Yn) is obtained by averaging a symmetric the kernel h over the observations of a sample size 4, which is

2 the smallest sample size for estimating Un. That is,

−1 n X U 2(X,Y ) = h((X ,Y ), ..., (X ,Y )), (4.1.1) n 4 k1 k1 k4 k4 1≤k1<...

n where the summation over all combinations 4 of 4 distinct elements 1 ≤ k1 < ... < k4 ≤ n is drawn without replacement from the set {1, ..., n}. 33 2 The U-statistic of V uses h((X1,Y1), ..., (X4,Y4)) for which there is an unbiased estimator, when V2 is represented as

2 V (X,Y ) = E(h((X1,Y1), ..., (X4,Y4))), (4.1.2) where h is a kernel of the estimator of distance covariance and it is symmetrical in its 4 arguments. We can obtain a characterization of a U-statistic for distance covariance. It is easy to show that the U-statistic of distance covariance is jackknife invariant. First, we will state the theorem of jackknife invariance of order m, which is due to Lenth (1983), and prove it in general.

Theorem 4.1.3. (Jackknife invariance theorem) Let Un(X1,...,Xn) be a statistic of a sample

−k X1,...,Xn. Let Un−1(X1,...,Xn), k = 1, 2, . . . , n, be a statistic of a reduced sample

−k X1,...,Xk−1,Xk+1,...,Xn; that is, Un−1(X1,...,Xn) is the statistic after removing the obser- vation Xk. This reduced statistic is called jackknife statistic. A necessary and sufficient condition for Un(X1,...,Xn) to be a U-statistic of order m is that Un(X1,...,Xn) has the jackknife invari- ance property: the identity

n X −k n · Un(X1,...,Xn) = Un−1(X1,...,Xn) (4.1.4) k=1 holds for all n > m.

Proof. Suppose n = m+1. Assume that it is true for n, and Un is a U-statistic with kernel function h, then we have

m+1 1 X U (X ,...,X ) = U −k (X ,...,X ) m+1 1 m+1 m + 1 m+1 1 m+1 k=1 m+1 1 X = U (X ,...,X ,X ,...,X ) m + 1 m+1−1 1 k−1 k+1 m+1 k=1 m+1 1 X = U (X ,...,X ,X ,...,X ), m + 1 m 1 k−1 k+1 m+1 k=1 34 where Um(X1,...,Xk−1,Xk+1,...,Xm+1) is the statistic after removing the element Xk. We must have Um+1 which equals

m+1 X (m + 1)Um+1 = h(X1,...,Xk−1,Xk+1,...,Xm+1). k=1

Suppose for any n + 1, Un+1 is a U-statistic with kernel h, then

n+1 1 X U (X ,...,X ) = U −k (X ,...,X ) n+1 1 n+1 n + 1 n+1 1 n+1 k=1 n+1 1 X = U (X ,...,X ,X ,...,X ) n + 1 n+1−1 1 k−1 k+1 n+1 k=1 n+1 1 X = U (X ,...,X ,X ,...,X ) n + 1 n 1 k−1 k+1 n+1 k=1 n+1 −1 1 X  n  X = h(X ,...,X ,X ,...,X ), n + 1 m 1 k−1 k+1 n+1 k=1 1≤k1<...

Un+1 is not contained in exactly n + 1 − m of 1 ≤ k1 < ... < km ≤ n + 1, so we have Un+1 as

−1 n+1 (n + 1 − m) n  X U (X ,...,X ) = h(X ,...,X ,X ,...,X ) n+1 1 n+1 n + 1 m 1 k−1 k+1 n+1 k=1 −1 n+1 n + 1 X = h(X ,...,X ,X ,...,X ). m 1 k−1 k+1 n+1 k=1

Applying Theorem 4.1.3 to show that distance covariance is a U-statistic. We can use a simpler

2 ˜ ˜ and faster computing formula for Un by combining the two U-centered matrices Aij and Bij, which 35 are defined in (3.3.2 ) and (3.3.3), respectively. Then

n n n ! 1 X 1 X 1 X 1 X U 2(X,Y ) = a − a − a + a n n(n − 3) ij n − 2 ij n − 2 ij (n − 1)(n − 2) ij i6=j j=1 i=1 i,j=1 n n n ! 1 X 1 X 1 X b − b − b + b , ij n − 2 ij n − 2 ij (n − 1)(n − 2) ij j=1 i=1 i,j=1 (4.1.5)

Pn Pn Pn where ai.bi. = j=1 aijbij and a..b.. = i,j=1 aij i,j=1 bij and after simplification, we obtain

n 1 X 2 X a..b.. U 2(X,Y ) = a b − a b + . (4.1.6) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1

The unbiased squared sample distance variances of X and Y , respectively, are

n 1 X 2 X a..a.. U 2(X) = a a − a a + , (4.1.7) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1

and

n 1 X 2 X b..b.. U 2(Y ) = b b − b b + . (4.1.8) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1

2 We will use Un as defined in (4.1.6) to verify the jackknife invariance condition of Theorem 4.1.3. 36

When we delete the observations Xk and Yk we have

2(−k) 1 X U (X,Y ) = a b n−1 (n − 1)(n − 2) ij ij i6=j,i6=k,j6=k

1 X (−k) (−k) (−k) − 2 a (b − b ) (n − 1)(n − 2)(n − 3) ij i. ij i6=j

1 X (−k) (−k) (−k) + a (b(−k) − 4b + 2b ) (n − 1)(n − 2)(n − 3)(n − 4) ij .. i. ij i6=j (n − 3)(n − 4) + 2(n − 4) + 2 X = a b (n − 1)(n − 2)(n − 3)(n − 4) ij ij i6=j,i6=k,j6=k   n (n − 4) + 2 X (−k) (−k) − 2 a b (n − 1)(n − 2)(n − 3)(n − 4) i. i. i=1 a(−k)b(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4)

Therefore,

n 2(−k) 1 X 1 X (−k) (−k) U (X,Y ) = a b − 2 a b n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 a(−k)b(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4) (4.1.9)

−k −k −k −k For i 6= k, we define ai. , bi. , a.. , and b.. as

−k −k ai. = ai. − aik, bi. = bi. − bik,

−k −k a.. = a.. − 2ak., b.. = b.. − 2bk..

2(−k) −k −k −k −k 2(−k) Then, we take the summation of Un−1 after we substitute ai. , bi. , a.. , and b.. into Un−1 . 37 Pn 2(−k) Therefore, k=1 Un−1 is as follows:

n Pn P ! X 2(−k) (n − 2) X (n − 3) i=1 ai.bi. + i6=k aikbik U (X,Y ) = a b − 2 n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) k=1 i6=j,i6=k,j6=k (n − 4)a b + 4 Pn a b + .. .. k=1 k. k. . (n − 1)(n − 2)(n − 3)(n − 4) (4.1.10)

2 From the above calculations, we can conclude that the summation of Un after removing observation

2 Xk and Yk equals nUn; that is

n n X 2(−k) 1 X 2 X a..b.. U (X,Y ) = a b − a b + n−1 (n − 3) ij ij (n − 2)(n − 3) i. i. (n − 1)(n − 2)(n − 3) k=1 i6=j i=1

2 = n ·Un(X,Y ). (4.1.11)

2 2 Also Un(X) and Un(Y ) have the jackknife invariance property, such as

n 2 X 2(−k) n ·Un(X) = Un−1 (X), k=1

and n 2 X 2(−k) n ·Un(Y ) = Un−1 (Y ), k=1

2 2 2(−k) where Un(X) and Un(Y ) are defined in (4.1.7) and (4.1.8), respectively. In addition, Un−1 (X) 2(−k) and Un−1 (Y ) are defined as follows:

n 2(−k) 1 X 1 X (−k) (−k) U (X) = a a − 2 a a n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 a(−k)a(−k) + .. .. , (n − 1)(n − 2)(n − 3)(n − 4) (4.1.12) 38 and

n (−k) 1 X 1 X (−k) (−k) U 2 (Y ) = b b − 2 b b n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 (4.1.13) b(−k)b(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4)

2 When a distance correlation is the normalized coefficient of distance covariance Un, it is called a bias-corrected distance correlation,

n ∗∗ X ∗∗(−k) nRn := Rn−1 , (4.1.14) k=1

where bias-corrected distance correlation after removing the observations Xk and Yk is defined as

 2(−k) Un−1 (X,Y ) 2(−k) 2(−k)  q , U (X)U (Y ) > 0; ∗∗(−k)  2(−k) 2(−k) n−1 n−1 Un−1 (X)Un−1 (Y ) Rn−1 = (4.1.15)  2(−k) 2(−k)  0, Un−1 (X)Un−1 (Y ) = 0,

the unbiased squared sample covariance of X and Y when we delete the observations Xk and

2(−k) 2(−k) 2(−k) Yk, and Un−1 (X,Y ) is defined in (4.1.9). Un−1 (X) and Un−1 (Y ) are defined respectively in

∗∗ (4.1.12) and (4.1.13). We can write the bias-corrected distance correlation Rn with kernel function h(·) of order 4, as

U 2(X,Y ) R∗∗ = n n p 2 2 Un(X)Un(Y ) n−1 P h((X ,Y ), ..., (X ,Y )) (4.1.16) 4 1≤k1<...

2 ∗∗ and R = Eh ((X1,Y1), ..., (X4,Y4)), where

∗∗ h((Xk1,Yk1), ..., (Xk4,Yk4)) h ((Xk1,Yk1), ..., (Xk4,Yk4)) = p . (4.1.17) h(Xk1, ..., Xk4).h(Yk1, ..., Yk4) 39 4.1.2 Jackknife Empirical Likelihood for Distance Correlation

We construct the jackknife pseudo-values of sample distance correlation based on sample size

∗∗ n, and the jackknife pseudo-values of Rn as

∗∗ ∗∗(−k) Zk = nRn − (n − 1)Rn−1 , k = 1, ..., n,

∗∗(−k) 2 where Rn−1 is defined in (4.1.15). The jackknife estimator of R is the average of the pseudo- values n 1 X R2 = Z . n,jack n k k=1

2 ∗∗ The condition Rn,jack = Rn is equivalent to Equation (4.1.14) since we further have

n 1 X ∗∗(−k) R2 − R∗∗ (n − 1)[R∗∗ − R ], (4.1.18) n,jack n u n n n−1 k=1

∗∗ where Zk is the jackknife pseudo-value of Rn . The observations are asymptotically independent

so we are able to apply the empirical likelihood method to Zk.

Remark 4.1.19. The jackknife empirical likelihood method applies for the U-statistic estimator of squared distance covariance. Using the jackknife empirical likelihood, we can derive a confidence

interval estimate for the true population distance covariance U 2. However, the numerical values

2 2 of U and Un are affected by the scale of the random variables. We seek a confidence interval for the squared distance correlation R2, because it is normalized by the distance variances to values in [0, 1]. The bias-corrected distance correlation statistic is not a U-statistic, so the jackknife empirical likelihood method does not directly apply. However, we observe that if we normalize the data prior to computing the distance covariance, we obtain the bias-corrected distance correlation.

That is, if we transform samples X and Y by

Xi Yi Xi = , Yi = , i = 1, . . . , n, e p 2 e p 2 Un(X) Un(Y ) 40 ∗∗ ∗∗ then Un(X,e Ye) = Rn (X,Y ). Asymptotically, the denominator of Rn (X,Y ) converges to the constant pU 2(X)U 2(Y ), and the jackknife empirical likelihood method empirically performs the same for the normalized and unnormalized statistics.

Let p = (p1, ..., pn) be a probability vector with probability pk for Zk. Then the jackknife empirical likelihood for the distance correlation R2 can be defined as follows:

( n n n ) 2 Y X X 2 L(R ) = max pk : p1 ≥ ... ≥ pn, pk = 1, pkZk = R , k=1 k=1 k=1

Qn Pn 2 where k=1 pk is subject to k=1 pk = 1, pk ≥ 0, k = 1, 2, ..., n. The likelihood L(R ) attains its

−n −1 2 maximum n at pk = n . The jackknife empirical likelihood ratio at R is defined as

( n n n ) L(R2) Y X X R(R2) := = max (np ): p ≥ ... ≥ p , p = 1, p Z = R2 . (4.1.20) n−n k 1 n k k k k=1 k=1 k=1

By using the method of Lagrange multiplier to solve a constrained optimization problem (4.1.20), we find n n n ! 2 X X 2 X G(R ) = log npk − nλ pk(Zk − R ) + γ pk − 1 , (4.1.21) k=1 k=1 k=1

where the function G is called the Lagrangian; λ ∈ Rq are Lagrange multipliers for the second set of constraints, γ ∈ R is Lagrange multiplier for the third set of constraints, and n is sample size. Finding the root of ∂G = 0, k = 1, ..., n, we further have ∂pk

∂G 1 2 = − nλ(Zk − R ) + γ = 0. ∂pk npk

Pn 2 It is straightforwardly verified that the solution to (4.1.21), where the constraint k=1 pkZk = R Pn Pn ∂G and pk = 1, is pk = γ + n = 0. Hence, it can be represented in the form γ = −n. k=1 k=1 ∂pk We have that

∂G 1 2 = − nλ(Zk − R ) − n = 0. ∂pk npk 41 2 2 We maximize R(R ), when min1≤k≤n Zk < R < max1≤k≤n Zk, as

1 pk = 2 , (4.1.22) n(1 + λ(Zk − R ))

where λ is the solution to n 2 X (Zk − R ) f(λ) = 2 = 0. (4.1.23) 1 + λ(Zk − R ) k=1

2 Substituting pk in (4.1.22) back into R(R ) with solution λ from (4.1.23), the jackknife empir- ical likelihood ratio for R2 can be defined as

n n 2 Y Y 2 −1 R(R ) = (npk) = {1 + λ(Zk − R )} , k=1 k=1

and taking the logarithm of R(R2), which gives the nonparametric jackknife empirical log-likelihood ratio for R2, is n 2 X 2 −2 log R(R ) = 2 log{1 + λ(Zk − R )}. k=1 Therefore, Wilks’ theorem holds when −2 log R(R2) converges in distribution to a chi-square distribution.

Theorem 4.1.24. Define

2 h1(x, y) = Eh((x, y), (Xk2,Yk2), (Xk3,Yk3), (Xk4,Yk4)) − V (X,Y ), (4.1.25)

2 and σxy1 = V ar(h1(X1,Y1));

∗∗ ∗∗ 2 h1 (x, y) = Eh ((x, y), (Xk2,Yk2), (Xk3,Yk3), (Xk4,Yk4)) − R , (4.1.26)

2 ∗∗ and σ1 = V ar(h1 (X1,Y1)).

2 2 2 2 Assume that Eh ((X1,Y1), (X2,Y2), (X3,Y3), (X4,Y4)) < ∞, σxy1 > 0, σx1 > 0, and σy1 >

∗∗2 2 0. Also assume that Eh ((X1,Y1), (X2,Y2), (X3,Y3), (X4,Y4)) < ∞ and σ1 > 0. 42 Then −2 log R(R2) converges in distribution to a chi-square distribution with one degree of freedom as n → ∞. That is,

2 d 2 −2 log R(R ) −→ χ1.

Based on the above Theorem 4.1.24, a jackknife empirical likelihood confidence interval for R2 with level (1 − α) can be defined as

2 2 2 Iα = {R : −2 log R(R ) ≤ χ1,1−α},

2 2 where χ1,1−α is the (1 − α) quantile of χ1. To prove Theorem 4.1.24, first we provide several lemmas.

Lemma 4.1.27. Under the conditions of Theorem 4.1.24, we have

√ 2 2 d 2 2 n(Un − V ) −→ N(0, 4 σxy1),

and

∗∗ 2 d 2 (Rn − R ) −→ N(0, σ1),

2 2 ∗∗ where σxy1 = V ar(h1(X1,Y1)) and σ1 = V ar(h1 (X1,Y1)).

2 Proof. First, we have to show that unbiased estimator of the squared distance covariance Un given in Equation (4.1.6) converges in distribution as n → ∞ to a Gaussian distribution with mean

2 2 2 zero and variance 4 σxy1. The unbiased estimator of the squared distance covariance Un can be alternatively formulated using a U-statistic as given in (4.1.1). Let n 4 X U˜2 = h (X ,Y ). (4.1.28) n n 1 k k k=1

2 Since 4h1(Xk,Yk) are i.i.d. with mean zero and variance σxy1, the Central Limit Theorem implies √ ˜2 2 2 that nUn → N(0, 4 σxy1). √ √ ˜2 2 2 We have to show that nUn is asymptotically equivalent to n(Un − V ) and they have the 43 same limiting distribution. Since the mean is zero, it suffices to show that the variance converges

˜2 2 2 2 to zero. Hence, we need only to show that E(Un − (Un − V )) → 0.

˜2 2 2 2 ˜2 ˜2 2 2 E(Un − (Un − V )) = V ar(Un) − 2Cov(Un, Un) + V ar(Un), (4.1.29)

42σ2 42σ2 ˜2 xy1 xy1 where V ar(Un) = n and the last term on the right converges to n , based on Hoeffding ˜2 2 42 2 (1948a). To show that Cov(Un, Un) = n σxy1, we derive

−1 n 4 n X X Cov(U˜2, U 2) = Cov(h (X ,Y ), h((X ,Y ), ..., (X ,Y ))) n n n 4 1 k k k1 k1 k4 k4 k=1 1≤k1<...

42σ2 2 ˜2 2 xy1 and since V ar(h1(X1,Y1)) = σxy1, we further have that Cov(Un, Un) is equal to n . Therefore, √ ˜2 2 2 2 2 2 2 2 it is clear that E(Un − (Un − V )) = 0. We conclude that n(Un − V ) → N(0, 4 σxy1). A bias-

∗∗ corrected estimator of the squared distance correlation Rn is the normalized version of distance

2 ∗∗ 2 d 2 covariance Un. Hence (Rn − R ) −→ N(0, σ1).

In what follows, the notation f(x) = O(g(x)) indicates that the function f(x) is eventually bounded by a multiple of function g(x) and f(x) = o(g(x)) indicates that the function f(x)/g(x)

converges to zero. In addition, the notation f(x) = Op(g(x)) means the function f(x) is bounded

in probability by a multiple of function g(x) and f(x) = op(g(x)) means the function f(x)/g(x) converges in probability to zero.

Lemma 4.1.31. Under the conditions of Theorem 4.1.24, we have

n 1 X p S = (Z − R2)2 −→ nσ2. n n k 1 k=1 44 Proof.

n n 1 X 1 X (Z − R2)2 = (Z − R∗∗ + R∗∗ − R2)2 n k n k k=1 k=1 n 1 X = (Z − R∗∗)2 + (R∗∗ − R2)2 n k n n k=1

= I1 + I2,

1 Pn ∗∗ 2 ∗∗ 2 2 2 where I1 = n k=1(Zk − Rn ) , and I2 = (Rn − R ) . Since V arjack = V ar(Rn,jack) = 1 Pn 2 2 ∗∗ 2 n(n−1) k=1(Zk − Rn,jack) and Rn = Rn,jack, we have that

n 1 X I = (Z − R∗∗)2 = (n − 1)V ar . 1 n k n jack k=1

∗∗ It follows from Sen (1977) that V arjack is asymptotically equivalent to V ar(Rn ). Hence

∗∗ 2 I1 = (n − 1)V ar(Rn ) = nσ1 + o(1). (4.1.32)

∗∗ 2 2 ∗∗ I2 = (Rn − R ) , and by the Strong Law of Large Numbers for a U-statistic, we have Rn =

2 R + o(1). Combining I1 and I2, we obtain

2 Sn = nσ1 + o(1). a.s.

1 Pn 2 2 p 2 That is, Sn = n k=1(Zk − R ) −→ nσ1.

2 Lemma 4.1.33. Let Wn = max1≤k≤n |Zk −R |. Under the conditions of Theorem 4.1.24, we have

1 Wn = o(n 2 ) a.s.

2 2 ˜2 Proof. From Lemma 4.1.27, we have that (Un − V ) has the same asymptotic behavior as Un in 45 Equation (4.1.28). By Markov’s inequality, we derive for any  > 0,

n 1 X 1 P ( max |4h1(Xk,Yk)| ≥ n 2 ) ≤ P (|4h1(Xk,Yk)| ≥ n 2 ) 1≤k≤n k=1 1 (4.1.34) = nP (|4h1(X1,Y1)| ≥ n 2 ) 2 E|4h1(X1,Y1)| ≤ 1 . n 2 2

Then we obtain

1 lim P ( max |4h1(Xk,Yk)| ≥ n 2 ) = 0. n→∞ 1≤k≤n

1 ∗∗ This shows that max1≤k≤n |4h1(Xk,Yk)| = o(n 2 ). It follows that max1≤k≤n |h1 (Xk,Yk)| =

1 2 ∗∗ − 1 o(n 2 ). From Peng and Tan (2018), we have that Zk − R = h1 (Xk,Yk) + Op(n 2 ). Hence,

2 1 Wn = max1≤k≤n |Zk − R | = o(n 2 ) a.s.

With the help of the Lemma 4.1.27, Lemma 4.1.31, and Lemma 4.1.33, we can prove the main result of Theorem 4.1.24.

Proof of Theorem 4.1.24.

Pn 2 1 k=1(Zk − R ) |f(λ)| = 2 n 1 + λ(Zk − R )

1 n n (Z − R2)2 X 2 X k = (Zk − R ) − λ 2 n 1 + λ(Zk − R ) k=1 k=1

λ n (Z − R2)2 1 n X k X 2 ≥ 2 − (Zk − R ) n 1 + λ(Zk − R ) n k=1 k=1 n |λ| Sn 1 X 2 ≥ − (Zk − R ) , 1 + |λ| Wn n k=1

1 Pn 2 2 2 where Sn = n k=1(Zk − R ) = nσ1 + o(1) a.s. by Lemma 4.1.31 and Wn = max1≤k≤n |Zk − 46 2 1 R | = o(n 2 ) a.s. by Lemma 4.1.33. From Lemma 4.1.27, we have that

n 1 X 2 ∗∗ 2 − 1 (Zk − R ) = R − R = Op(n 2 ). n n k=1

It follows that

|λ| 1 − 2 2 −1 = Op(n )(nσ1 + o(1)) . 1 + |λ| Wn

Thus,

− 1 |λ| = Op(n 2 ). (4.1.35)

2 Let γk = λ(Zk − R ), then

− 1 1 max |γk| = |λ| Wn = Op(n 2 )o(n 2 ) = op(1). (4.1.36) 1≤k≤n

When we plug the estimator of γk back into (4.1.23), we have

n 2 1 X 2 γk 0 = f(λ) = (Zk − R )(1 − γk + ) n 1 + γk k=1 (4.1.37) n n 2 2 1 X 2 1 X (Zk − R )γk = (Zk − R ) − λSn + . n n 1 + γk k=1 k=1

The last term on the right in (4.1.37) can be bounded by

n 2 2 2 n 2 3 1 X (Zk − R )γ λ X (Zk − R ) k = n 1 + γk n 1 + γk k=1 k=1 λ2W S ≤ n n 1 + γk

1 −1 2 2 = Op(n )o(n )(nσ1 + o(1))Op(1)

− 1 = op(n 2 ).

Therefore, we find λ as ∗∗ 2 (Rn − R ) − 1 λ = + op(n 2 ). Sn 47 2 γk By Taylor expansion, we have log(1 + γk) = γk − 2 + ηk, where for a number 0 < B < ∞, we have that as n → ∞,

3 P (|ηk| ≤ B|γk| , 1 ≤ k ≤ n) → 1.

The jackknife empirical log likelihood ratio at R2 becomes

n 2 X −2 log R(R ) = −2 log(npk) k=1 n X = 2 log(1 + γk) k=1 n n n X X 2 X = 2 γk − γk + 2 ηk k=1 k=1 k=1 n n 1 X X = 2nλ (Z − R2) − nS λ2 + 2 η n k n k k=1 k=1 ∗∗ 2 2 n n(Rn − R ) −1 X = − nSnop(n ) + 2 ηk, Sn k=1

−1 2 −1 where | − nSnop(n )| = n(nσ1 + o(1))op(n ) = op(1) and

n n n X X 3 X 2 3 3 −3/2 3/2 | ηk| ≤ |ηk| ≤ B|λ| |Zk − R | ≤ B|λ| nWnSn = Op(n )o(n ) = op(1). k=1 k=1 k=1

By Lemma 4.1.27 and Lemma 4.1.31, we have that

∗∗ 2 2 n(Rn − R ) d 2 −→ χ1. Sn

d 2 Hence, from Slutsky’s theorem, we have −2 log R(θ) −→ χ1.

4.2 Simulation Study

We evaluate the jackknife empirical likelihood confidence interval for distance correlation by simulation studies and compare the results with bootstrap method. The bootstrap is commonly used to calculate confidence interval even if the underlying distribution is not normal and it is a simple resampling method that can approximate the population. We use the standard normal 48 bootstrap confidence interval with B = 999 replications. An approximate 100(1−α)% confidence interval for R2 is defined by

∗∗ R ± Z α SE, n 1− 2 c

α where SEc is the bootstrap estimate of the standard error and Z is the upper 2 critical value from a standard normal distribution. The boot.ci function in R package boot creates confidence intervals for bootstrap and we specify the type of confidence interval as “norm” to obtain the normal boot- strap interval. (This implementation first applies a bias correction to the point estimate using the bootstrap estimate of bias.) In the simulation studies, we draw 10, 000 random samples of different sample sizes n = 25, 50, 100 from a bivariate standard normal distribution with correlation ρ. According to Szekely,´ Rizzo, and Bakirov (2007), the distance correlation R2 for a bivariate standard normal distribution can be computed by using the formula

ρ arcsin ρ + p1 − ρ2 − ρ arcsin ρ/2 − p4 − ρ2 + 1 R2(X,Y ) = √ . 1 + π/3 − 3

We calculate the confidence interval for distance correlation Iα based on jackknife empirical like- lihood and standard normal bootstrap at different nominal significance levels 90%, 95%, 99% for ρ = 0, 0.2, 0.5, 0.8 which correspond to distance correlation R = 0, 0.179, 0.454, 0.755, respec- tively. In order to compare the two confidence interval methods, we report the performance of cov- erage probabilities and average interval lengths. The Monte Carlo approximation to the coverage probability and average interval length for the proposed confidence intervals are calculated and are defined respectively as B B 1 X 2 1 X Iα{R ∈ CIc j} and |CIc j|. B B j=1 j=1

The simulation results of the coverage probability and average interval length of 90%, 95%, and 99% confidence intervals are illustrated in Table 4.1, Table 4.2, and Table 4.3. From Tables 49 4.1-4.3, we observe that the jackknife empirical likelihood method confidence intervals typically have better coverage rates than the bootstrap normal confidence intervals, while average interval lengths of both methods are approximately the same. One could alternately first normalize the samples using the distance variances, then determine the jackknife empirical likelihood confidence intervals for the distance covariance. The coverage rates for the distance covariance the jackknife empirical likelihood confidence intervals were essentially equivalent to those in Tables 4.1-4.3. 50 Table 4.1 Coverage probabilities and average interval lengths of 90% confidence interval for R2

Jackknife empirical likelihood Standard normal bootstrap n ρ R Coverage probability Interval length Coverage probability Interval length 0 0 91.16 0.2729 53.32 0.3000 0.2 0.179 84.77 0.3286 55.03 0.3321 25 0.5 0.454 83.69 0.5059 67.69 0.4394 0.8 0.753 87.79 0.4847 79.21 0.4181 0 0 92.08 0.1261 51.66 0.1525 0.2 0.179 81.86 0.1851 58.28 0.1940 50 0.5 0.454 86.83 0.3382 76.58 0.3122 0.8 0.753 89.68 0.3210 84.82 0.2967 0 0 91.90 0.0617 52.19 0.0774 0.2 0.179 82.28 0.1160 65.32 0.1205 100 0.5 0.454 88.37 0.2324 81.83 0.2228 0.8 0.753 90.38 0.2198 87.47 0.2109

Table 4.2 Coverage probabilities and average interval lengths of 95% confidence interval for R2

Jackknife empirical likelihood Standard normal bootstrap n ρ R Coverage probability Interval length Coverage probability Interval length 0 0 94.30 0.3288 68.32 0.3574 0.2 0.179 88.67 0.3958 65.64 0.3957 25 0.5 0.454 87.97 0.6093 75.22 0.5236 0.8 0.753 92.41 0.5840 86.26 0.4982 0 0 95.03 0.1516 66.66 0.1817 0.2 0.179 86.23 0.2224 66.47 0.2312 50 0.5 0.454 91.52 0.4063 82.78 0.3720 0.8 0.753 94.29 0.3861 90.99 0.3535 0 0 95.02 0.0739 65.51 0.0922 0.2 0.179 85.98 0.1389 72.02 0.1436 100 0.5 0.454 93.63 0.2782 88.17 0.2654 0.8 0.753 95.19 0.2636 93.14 0.2513 51 Table 4.3 Coverage probabilities and average interval lengths of 99% confidence interval for R2

Jackknife empirical likelihood Standard normal bootstrap n ρ R Coverage probability Interval length Coverage probability Interval length 0 0 97.63 0.4419 92.50 0.4697 0.2 0.179 94.25 0.5318 84.86 0.5201 25 0.5 0.454 93.12 0.8183 85.62 0.6882 0.8 0.753 96.69 0.7849 93.74 0.6548 0 0 98.07 0.2034 91.34 0.2388 0.2 0.179 91.86 0.2979 81.14 0.3038 50 0.5 0.454 96.35 0.5442 91.32 0.4889 0.8 0.753 98.39 0.5183 96.86 0.4646 0 0 98.20 0.0985 89.64 0.1212 0.2 0.179 90.75 0.1846 82.21 0.1888 100 0.5 0.454 97.70 0.3699 95.33 0.3488 0.8 0.753 98.79 0.3516 97.83 0.3302

4.3 Real Examples

The best olive oils are available in Italy, where an olive oil contains mainly monounsaturated fats, saturated fats, and polyunsaturated fats. While monounsaturated fatty acids have a single carbon-to-carbon double bond, saturated fatty acids have no double bonds, and polyunsaturated fatty acids have many double bonds. We consider the dataset from a study of Italian olive oils in three different areas in Italy by Forina and Tiscornia (1982). This data consists of the percent- age composition of eight different fatty acids found in the lipid fraction of 572 Italian olive oils depending on geographic areas (South, North, and Sardinia). We use the olive oil data to com- pute confidence intervals for the bivariate variables. We consider only six fatty acids: palmitic, palmitoleic, stearic, oleic, linoleic, and linolenic. Oleic and palmitoleic fatty acids are the types of the monounsaturated fats, palmitic and stearic fatty acids are the types of the saturated fats, and linoleic and linolenic fatty acids are the types of the polyunsaturated fats. The scatterplot matrix contains the pairwise scatter plots of six fatty acids as shown in Figure 4.1, and we observe that there are some linear and nonlinear relations between fatty acids. Table 4.4 is the summary statistics of six fatty acids and we observe that the olive oils contain higher percentage of oleic acid and slightly higher palmitic and linoleic acids content. But the olive oils 52 contain low or no linolenic acid.

Table 4.4 Summary statistics of fatty acids

Monounsaturated Saturated Polyunsaturated fats oleic palmitoleic palmitic stearic linoleic linolenic min. 6300 15 610 152 448 0 median 7302 110 1201 223 1030 26 mean 7312 126 1232 228 980.5 31.89 max. 8410 280 1753 375 1470 74 53

Figure 4.1 Scatterplot matrix of pairwise association of six fatty acids 54 To analyze the pairwise relations between oleic and palmitoleic fatty acids, palmitic and stearic fatty acids, and linoleic and linolenic fatty acids. In statistical hypothesis testing, there exists the significant associations between two fatty acids of the monounsaturated fats, two fatty acids of the saturated fats, and two fatty acids of the polyunsaturated fats at the both significant levels of 0.05 and 0.10. The corresponding p-values determine using function dcor.test with 999 replicates in R package energy. We use the confidence interval for distance correlation to test whether there is an association between the two fatty acids at a signicance level of 95% and 90%, as displayed in Table 4.5. There are the significant associations between oleic and palmitoleic fatty acids of the monounsaturated fats and also between linoleic and linolenic fatty acids of the polyunsaturated fats at the both 0.05 and 0.10 levels. We notice that the lower intervals of the saturated fats at 0.05 and 0.10 level are close to zero. There is no significant association between palmitic and stearic fatty acids of the saturated fats at 0.05 level, which means there is a weak relation between palmitic and stearic fatty acids.

Table 4.5 The confidence intervals for bias-corrected distance correlation of bivariate variables of monounsaturated fats, saturated fats, and polyunsaturated fats

95% confidence interval 90% confidence interval fats lower interval upper interval lower interval upper interval Monounsaturated 0.6360 0.7274 0.6442 0.7207 Saturated -0.000415 0.0398 0.00284 0.0366 Polyunsaturated 0.04356 0.0878 0.04724 0.0844 55

CHAPTER 5 LOCAL DISTANCE CORRELATION

In this Chapter, we propose a new method to compute the local distance correlation which is an extension of local Gaussian correlation. We will discuss the estimation of local distance correlation after local Gaussian correlation. In addition, we will consider the visualization of local distance correlation and compare it with the visualization of local Gaussian correlation in simulation studies and real examples.

5.1 Local Gaussian Correlation

A measure of local dependence based on localizing the correlation coefficient was introduced by Jones (1996). The localization version of correlation coefficient by taking correlation around

(x0, y0) is applied by using the weight function

−1 −1 K(b1 (x − x0), b2 (y − y0)) w0(x, y) = , b1b2

where K is a kernel of a bivariate density estimator and b1, b2 are the bandwidths. The localization of Pearson correlation ρ to a neighborhood of the point (x0, y0) is defined as

M0(x, y) γ0 = p , M0(x, x)M0(y, y)

where

E(w0(x, y)X)E(w0(x, y)Y ) M0(x, y) = E(w0(x, y)XY ) − , E(w0(x, y))

M0(x, x) and M0(y, y) are defined similarly. When w0(x, y) = 1 for all (x, y), then we get the correlation coefficient ρ. Measuring dependence between X and Y is a fundamental problem in statistics. Pearson cor- relation ρ works nicely to measure linear association of two univariate random variables but it is not able to capture nonlinear dependence structures in bivariate data. Tjøstheim and Hufthammer (2013) developed a local measure of dependence derived from a local correlation function. It is 56 based on approximating a bivariate density locally from a family of bivariate Gaussian densities using local likelihood. Local likelihood was introduced by Hjort and Jones (1996) and they give a possible connection between the parametric and nonparametric worlds. Unlike linear dependence measures, local Gaussian correlation describes nonlinear structure in dependence and does not have the bias problem as the conditional correlation. It is useful to search for a local dependence measure that characterizes dependence at a point. Berentsen and Tjøstheim (2014) considered vi- sualization of local dependence by plots of the local Gaussian correlation and adapted methodology of visualization of the local Gaussian correlation from Jones and Koch (2003).

5.1.1 Estimation of Local Gaussian Correlation

At each point, a Gaussian distribution is approximated; the correlation coefficient of the ap- proximated Gaussian distribution is taken as the local correlation. Locally in a neighborhood of each point z = (x, y), the Gaussian bivariate density is defined as

1 φ(v, θ(z)) = × p 2 2πσ1(z)σ2(z) 1 − ρ (z)   2 2  1 (v1 − µ1(z)) (v1 − µ1(z))(v2 − µ2(z)) (v2 − µ2(z)) exp − 2 2 − 2ρ(z) + 2 , 2(1 − ρ (z)) σ1(z) σ1(z)σ2(z) σ2(z) (5.1.1)

T T where v = (v1, v2) is the running variable in the Gaussian distribution, and θ(z) = [µ(z), Σ(z), ρ(z)]

T with µ(z) = [µ1(z), µ2(z)] is local mean vector, Σ(z) = [σij(z)] is local covariance matrix, and ρ(z) is local correlation. The population values of θ(z) are obtained by minimizing a local penalty function, given by

Z L = Kb(v − z)[φ(z, θ(z)) − log φ(v, θ(z))f(v)]dv,

−1 −1 −1 where Kb(v − z) = (b1b2) Kb1 (b1 (v1 − x)Kb2 (b2 (v2 − y) is a product kernel with bandwidth

b = (b1, b2), and L is the penalty function used by Hjort and Jones (1996) for density estimation 57 purposes. They argued that L can be interpreted as a locally weighted Kullback-Leibler criterion for measuring the distance between f(.) and φ(., θ(z)). Then, θ(z) could be chosen to minimize L, so that it would satisfy

Z ∂ Kb(v − z) {log φ(v, θ(z))}[f(v) − φ(v, θ(z))]dv = 0, j = 1, ..., 5. ∂θj

The corresponding estimates of θ(z) are obtained by maximizing the local log-likelihood function

n 1 X Z L(Z , ..., Z , θ(z)) = K (Z − z) log φ(Z , θ(z)) − K (v − z)φ(v, θ(z))dv. (5.1.2) 1 n n b i i b i

Let ∂ w(., θ) = {log φ(., θ)}, ∂θj

so that

n ∂L 1 X Z = K (Z − z)w (Z , θ(z)) − K (v − z)w (v, θ(z))φ(v, θ(z))dv, ∂θ n b i j i b j j i

which produces an estimate for local correlation ρ(z), estimates for local means µ(z), and estimates

for local variances Σ(z). As n → ∞ for fixed b, assuming that E{Kb(Zi − z)wj(Zi, θ(z))} < ∞, and using the law of large numbers, we have almost surely that

∂L Z → Kb(Zi − z)wj(v, θ(z))[f(v) − φ(v, θ(z))]dv. ∂θj

Berentsen and Tjøstheim (2014) used a bootstrap test assuming Zi are i.i.d random variables. However, they observed that the computational time increases, as for each bootstrap realization the local likelihood function had to be optimized numerically.

5.1.2 Choice of Bandwidth for Kernel Function

The bandwidth depends largely on the purpose of the user. If the user’s goal is to investigate the local dependence structure in the data, one can compute the local correlation for several band- 58 widths. It is a way to know the dependence structure on different scales of locality. Berentsen and Tjøstheim (2014) indicated that it would be better to have a data-driven choice of bandwidth like the bandwidth choice for density kernel estimation. Tjøstheim and Hufthammer (2013) discussed the choice of the bandwidth as a compromise between optimizing the bias reduction for a density estimate and the choice of the degree of the variance for a local correlation estimate. However, their bandwidth algorithm is not really satisfactory in a general situation, according to Berentsen and Tjøstheim (2014). Likelihood cross-validation to select appropriate bandwidth has been used by Berentsen and Tjøstheim as the optimizer of

−1 X −i CV (b) = n log φ(Zi, θ(Zi, θ (Zi)), i

−i where θ (Zi) is the leave-one-out estimate of θ(Zi), and φ is defined in (5.1.1).

5.1.3 Properties of Local Gaussian Correlation

The properties of local Gaussian correlation are the following:

1. Range: It can take values between −1 and 1, then it gives an interpretation to positive and negative local correlations in terms of the approximating Gaussian.

2. Independence: If X and Y are independent, then ρb(x, y) ≡ 0. For Gaussian variables only,

ρb(x, y) ≡ 0 implies independence.

3. In the Gaussian case, ρb(x, y) = ρ is constant.

4. Functional independence: If Y = f(X) ⇔ ρ(x, y) is +1 if f 0(x, y) > 0 and ρ(x, y) is −1 if f 0(x, y) < 0.

5. Symmetry: It is assumed that µ = E(X) = 0.

• Radial symmetry: ρb(−x) = ρb(x).

• Reflection symmetry: ρb(−x, y) = −ρb(x, y) and/or ρb(x, −y) = −ρb(x, y). 59

• Exchange symmetry: ρb(x, y) = ρb(y, x).

Symmetry properties of µ(x) and Σ(x) can conceivably be used to obtain more precise estimates, they increase the power of independence tests, according to Berentsen and Tjøstheim (2014).

5.1.4 Global Gaussian Correlation

Berentsen and Tjøstheim (2014) considered the global Gaussian correlation by aggregating local Gaussian correlation on subsets of R2 to get a global measure of dependence. The local Gaussian correlation can take negative and positive values. Thus for a nonlinear dependence struc- ture, Berentsen and Tjøstheim considered ρ2(x, y) to avoid the problem that the local correlation at different points canceled out. The global measure of dependence is

Z 1/2 2 1/2 2 τ = EF ρ (X,Y ) = ρ (x, y)dF (x, y) , (5.1.3)

where F (x, y) is the joint distribution function of X and Y . The properties of the global Gaussian correlation are:

1. Range: 0 ≤ τ ≤ 1.

2. Independence: If X and Y are independent, then τ = 0.

3. Functional dependence: If Y = f(X) ⇔ τ = 1.

4. Gaussian case: If X and Y are the joint Gaussian distribution with correlation coefficient ρ, then τ ≡ |ρ|.

The sample version of the global measure of dependence τ is defined as

Z 1/2 21/2 2 τn,b = EF n(ρn,b(X,Y )) = ρn,b(x, y)dFn(x, y) , (5.1.4)

1 Pn where Fn(x, y) = n i=1 I(Xi ≤ x, Yi ≤ y) with I(.) denoting the indicator function. The sample global measure of dependence that screens outliers outside some subset S of R2 is given 60 by R 2 1/2 ρn,b(x, y)IS(x, y)dFn(x, y) τn,b(S) = R , IS(x, y)dFn(x, y) R where the scaling IS(x, y)dFn(x, y) is done to sure 0 ≤ τn,b(S) ≤ 1. The asymptotic properties as follows:

d 1. θn,b(x, y) −→ θ(x, y), when b is fixed and n → ∞.

1/2 −1/2 d 2. (nb1b2) JbMb [θn,b − θ] −→ N(0,I), where I is the identity matrix of dimension 5 and

Z T Jb = Kb(v − z)w(v, θb(z))w (v, θb(z))φ(v, θb(z))dv Z (5.1.5) − Kb(v − z)∇w(v, θb(z))(f(v) − φ(v, θb(z)))dv,

where w(z, θ) = ∇ log φ(z, θ), and

Z 2 T Mb = b1b2 Kb (v − z)w(v, θb(z))w (v, θb(z))f(v)dv Z Z (5.1.6) T − b1b2 Kb(v − z)w(v, θb(z))f(v)dv Kb(v − z)w (v, θb(z))f(v)dv.

. R 3. The test statistic Tn,b = S f(ρn,b(x, y))dFn(x, y) which estimates the functional . R 2 T = S f(ρ(x, y))dF (x, y), depends only on f(x) = x and f(x) = x . a.s Therefore, Tn,b −→ T.

1/2 d R 2 RR  4. n (Tn,b − Tb) −→ N 0, Ab(x) dF (x) − Ab(x)Ab(y)dF (x)dF (y) , where

Z 5 0 X j Ab(x) = f (ρb(v)) ab(v)Kb(x − v)wj(x, θb(v))dF (v) + f(ρb(x))IS(x), S j=1

j −1 and ab are the elements in the fifth row of Jb where Jb is defined in (5.1.5).

To test the hypothesis

Ho : Xand Y are independent, vs H1 : X and Y are not independent. 61 Berentsen and Tjøstheim (2014) used the bootstrap method and permutation test since the asymp- totic theory for functionals of type Tn,b is not accurate unless n is very large. They developed the R package localgauss (2014) to compute the local likelihood estimates θn,b(x, y) including

ρn,b(x, y).

5.2 Local Distance Correlation

Distance correlation, a nonparametric approach to measure dependence between random vec- tors, is one of the interesting topics in the statistical community. An extended concept of distance correlation, local distance correlation, presented in this research will be able to capture a local measure of dependence for nonlinear complex and nonmonotone types of dependence in certain regions.

5.2.1 Estimation of Local Distance Correlation

In order to estimate local distance correlation, one can use empirical likelihood method, in- troduced by Owen (1990), which provides a way to find efficient estimates, construct confidence

intervals, and test hypotheses by a nonparametric approach. Let X1, ..., Xn and Y1, ..., Yn be i.i.d.

from the underlying distributions FX and FY with pk = FX,Y (xk, yk), k = 1, 2, ..., n, where Pn pk ≥ 0, and k=1 pk = 1. The fundamental concept of the empirical likelihood at (p1, ..., pn) is defined as n Y L(p1, ..., pn) = pk, k=1 and the empirical log-likelihood is

n X l(p1, ..., pn) = log pk k=1

Pn where (p1, ..., pn) is subject to p1 ≥ ... ≥ pn, k=1 pk = 1, k = 1, 2, ..., n. It is clear that the 1 maximum of the log-likelihood function is attained at pk = n . Kitamura, Tripathi, and Ahn (2004) used a kernel function to calculate local empirical log-likelihood to estimate a regression model. 62 For our purpose, we define positive weights as

−1 −1 K(b1 (x0 − X), b2 (y0 − Y )) wb(x0, y0) = , (5.2.1) b1b2

Kb(./b) where Kb(.) = b is a product kernel function, and b1, b2 represent the window size of the local neighborhood. The kernel weight is used to carry out the localization for observations close

to (x0, y0). We start an estimation procedure by using the weight function wb to obtain the log- likelihood as n X l(p1, ..., pn) = wb log pk. k=1 The nonparametric maximum likelihood method shares some properties with conventional para- metric likelihood when we apply it to the mean functions. Empirical likelihood is a useful method with linear statistical functional when we use the method of Lagrange multipliers for the optimiza- tion problem, but it becomes a major problem in a nonlinear functional. Jing, Yuan, and Zhou (2009) extended empirical likelihood, called jackknife empirical likeli- hood, which can be used for nonlinear statistical functionals such as a U-statistic. The central idea of the jackknife empirical likelihood method is to use the jackknife pseudo-sample, which is de- fined by Quenouille (1956) as a sample of asymptotically independent observations; the jackknife estimator for parameter of interest becomes the sample mean of jackknife pseudo-samples. The empirical likelihood method can be easily applied for the mean of jackknife pseudo-samples, since empirical likelihood can be applied to a sample mean. In Chapter 4, we showed that we are able use directly the method of jackknife empirical likelihood for a U-statistic on distance correlation and proved that Wilks theorem holds. Our approach is to first rely on the local estimator of distance correlation via a local version of the empirical likelihood on the mean of jackknife pseudo-samples.

2 In Chapter 4, we showed that the squared distance covariance Un(X,Y ) is a U-statistic by applying the jackknife invariance of Theorem 4.1.3. That is,

n 2 X 2(−k) n ·Un(X,Y ) = Un−1 (X,Y ), (5.2.2) k=1 63 where

n 1 X 2 X a..b.. U 2(X,Y ) = a b − a b + . (5.2.3) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1

An unbiased estimator of the squared distance covariance after removing observations Xk and Yk is defined as

n 2(−k) 1 X 2 X (−k) (−k) U (X,Y ) = a b − a b n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 a(−k)b(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4) (5.2.4)

2 2 The unbiased estimators of the squared distance variances Un(X) and Un(Y ) also have the corre- sponding jackknife invariance property

n 2 X 2(−k) n ·Un(X) = Un−1 (X), k=1 and n 2 X 2(−k) n ·Un(Y ) = Un−1 (Y ), k=1

2 2 where Un(X) and Un(Y ) are defined as

n 1 X 2 X a..a.. U 2(X) = a a − a a + , (5.2.5) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1

n 1 X 2 X b..b.. U 2(Y ) = b b − b b + . (5.2.6) n n(n − 3) ij ij n(n − 2)(n − 3) i. i. n(n − 1)(n − 2)(n − 3) i6=j i=1 64

An unbiased estimator of the squared distance variance after removing observation Xk is

n 2(−k) 1 X 2 X (−k) (−k) U (X) = a a − a a n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 (5.2.7) a(−k)a(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4)

An unbiased estimator of the squared distance variance after removing observation Yk is

n 2(−k) 1 X 2 X (−k) (−k) U (Y ) = b b − b b n−1 (n − 1)(n − 4) ij ij (n − 1)(n − 3)(n − 4) i. i. i6=j,i6=k,j6=k i=1 (5.2.8) b(−k)b(−k) + .. .. . (n − 1)(n − 2)(n − 3)(n − 4)

∗∗ 2 ∗∗ Since Rn is defined by the standardized version of Un, Rn is a bias-corrected distance correlation.

∗∗ We constructed jackknife pseudo-values of sample distance correlation Rn based on sample size n as

∗∗ ∗∗(−k) Zk = nRn − (n − 1)Rn−1 , k = 1, ..., n,

where

 2  √ Un(X,Y ) , U 2(X)U 2(Y ) > 0;  2 2 n n ∗∗ Un(X)Un(Y ) Rn = (5.2.9)  2 2  0, Un(X)Un(Y ) = 0,

2 2 2 with the unbiased squared sample covariance Un(X,Y ) as defined in (5.2.3), Un(X) and Un(Y ) as ∗∗(−k) defined respectively in (5.2.5) and (5.2.6). Rn−1 is the bias-corrected distance correlation after removing the observations Xk and Yk, defined as

 2(−k) Un−1 (X,Y ) 2(−k) 2(−k)  q , U (X)U (Y ) > 0; ∗∗(−k)  2(−k) 2(−k) n−1 n−1 Un−1 (X)Un−1 (Y ) Rn−1 = (5.2.10)  2(−k) 2(−k)  0, Un−1 (X)Un−1 (Y ) = 0,

with the unbiased squared sample covariance of X and Y when we delete the observations Xk and

2(−k) 2(−k) 2(−k) Yk, Un−1 (X,Y ), defined in (5.2.4), Un−1 (X) and Un−1 (Y ) defined respectively in (5.2.7) and 65 (5.2.8). The jackknife estimator of R2 is the average of the pseudo-values

n 1 X R2 = Z , n,jack n k k=1

2 ∗∗ 1 Pn ∗∗(−k) and one can verify that the condition Rn,jack = Rn is equivalent to n k=1 Rn−1 . We further have n 1 X ∗∗(−k) R2 − R∗∗ (n − 1)[R∗∗ − R ]. (5.2.11) n,jack n u n n n−1 k=1 See Remark 4.1.19 in Section 4.1.2. Applying the idea of local empirical likelihood approach to the jackknife pseudo-sample, we concentrate the jackknife empirical likelihood locally for R2 to be defined as follows by using weight function wb in

( n n n ) 2 X X X 2 l(R ) = max wb log pk : p1 ≥ ... ≥ pn, pk = 1, pkZk = R , (5.2.12) k=1 k=1 k=1

where wb is defined in (5.2.1). It assigns the largest weights to data near point (x0, y0). To esti- mate local distance correlation, we maximize the profile likelihood l(R2) only on the observations

within a certain window around point (x0, y0). The mean functional for this window is

n 2 X R = pkZk, k=1

Pn which satisfies that k=1 pk = 1 and pk ≥ 0, for k = 1, ..., n. We use Lagrange multipliers where the Lagrangian is a way to solve a constrained optimization problem (5.2.12). That is,

n n n 2 X X 2 X L(R ) = wb log pk − Nλ pk(Zk − R ) + γ( pk − 1), (5.2.13) k=1 k=1 k=1

where the function L is called the Lagrangian; λ ∈ Rq are Lagrange multipliers for the second set of constraints, γ ∈ R is Lagrange multiplier for the third set of constraints, and N is summation of 66 weights. Finding the root of ∂L = 0, k = 1, ..., n, we further have ∂pk

∂L wb 2 = − Nλ(Zk − R ) + γ = 0. ∂pk pk

Pn 2 Pn It is easily verified that the solution to (5.2.13), with the constraint k=1 pkZk = R and k=1 pk = Pn ∂L 1, is pk = γ + N = 0. Hence, it can be represented in the form γ = −N. We have that k=1 ∂pk

∂L wb 2 = − Nλ(Zk − R ) − N = 0. ∂pk pk

We obtain the optimal pk as

1 wb pk = 2 , (5.2.14) N 1 + λ(Zk − R )

where λ is the solution to

n 2 X wb(Zk − R ) f(λ) = 2 = 0. (5.2.15) 1 + λ(Zk − R ) k=1

Using (5.2.14) and the solution λ from (5.2.15), the local jackknife empirical likelihood at R2 is defined as

n n 2 X X wb l(R ) = wb log pk = wb log 2 N{1 + λ(Zk − R )} k=1 k=1 n n (5.2.16) X wb X = w log − w log{1 + λ(Z − R2)}. b N b k k=1 k=1

The estimator of R2 is defined by the maximized profile likelihood l(R2)

2 2 Rc = argR2 max l(R ). (5.2.17)

Thus the local estimator of distance correlation can be viewed as a maximum local empirical 67 likelihood estimator based on the jackknife pseudo-sample.

5.2.2 Choice of Bandwidth for Kernel Function

Let X1,X2, ..., Xn be i.i.d observations and f(x) be the true probability density function of the sampled population. The kernel density estimate of f(x) with given bandwidth b is

n   1 X x − Xk fˆ(x) = K , (5.2.18) b nb b k=1

where K is a kernel function that satisfies:

1. K(u) > 0 and R K(u)du = 1.

2. R uK(u)du = 0.

3. 0 < R u2K(u)du < ∞.

The kernel function’s shape and width is determined by choosing bandwidth b, and here we are using a Gaussian kernel function, which is defined as

 u2  K(u) = (2π)−1/2 exp − . 2

An important problem for estimating local distance correlation Rc2 is the choice of bandwidth b, where b is a window taken in order to determine how much of the data within this window are

used to estimate each Rc2. Indeed, a large bandwidth gives more bias but less variable estimates unlike a small bandwidth which gives less bias but more variable estimates. To evaluate bandwidth selection performance, a commonly used error criteria is the mean integrated squared error (MISE). ˆ A discrepancy measure between f(x) and fb(x) at a point is the mean squared error, which can be written as ˆ 2 MSEb = E(fb(x) − f(x)) , 68 and the function expressed above can be stated in terms of the squared bias and variance as

ˆ ˆ 2 MSEb = V ar(fb(x)) + (E(fb(x)) − f(x)) . (5.2.19)

ˆ We consider an error criterion which is the average of the distance between the functions fb(x) and f(x) when we take the integral over the real line, defined by

Z  ˆ 2 MISEb = E (fb(x) − f(x)) dx . (5.2.20)

By Fubini’s theorem, MISEb is an equivalent to the following in terms of the squared bias and variance: Z Z ˆ ˆ 2 MISEb = V ar(fb(x))dx + (E(fb(x)) − f(x)) dx. (5.2.21)

ˆ If we take the expectation of fb(x), we further have

n  !    1 X x − Xk x − X E(fˆ(x)) = E K = E b−1K = E(K (x − X)), (5.2.22) b nb b b b k=1

R ˆ using the fact that E(g(x)) = g(x)f(x)dx and the definition of convolution. Thus E(fb(x)) can be written as Z ˆ E(fb(x)) = (Kb ∗ f)(x) = Kb(x − y)f(y)dy. (5.2.23)

The variance can be written as

1 V ar(fˆ(x)) = ((K2 ∗ f)(x) − (K ∗ f)2(x)), (5.2.24) b n b b

and the bias is ˆ E(fb(x)) − f(x) = (Kb ∗ f)(x) − f(x). (5.2.25)

If we combine variance in (5.2.24) and the square of bias in (5.2.25), then MISEb in (5.2.21) can 69 be written as

1 Z Z MISE = ((K2 ∗ f)(x) − (K ∗ f)2(x))dx + ((K ∗ f)(x) − f(x))2dx. (5.2.26) b n b b b

We can write MISEb as in Wand and Jones (1995), where after some straightforward manipula- tions, we have

1 Z 1 Z Z Z MISE = K2(x)dx + (K ∗ f)2(x)dx − 2 (K ∗ f)(x)f(x)dx + f 2(x)dx. b nb 1 − n−1 b b (5.2.27) We first consider the estimation of f(x) when z = (x − y)/b and the Jacobian of z is b; therefore, ˆ E(fb(x)) in (5.2.23) can be written as

Z ˆ E(fb(x)) = K(z)f(x − bz)dz. (5.2.28)

We use a second order Taylor series about x for f(x − bz), which is

1 00 f(x − bz) = f(x) − bzf 0(x) + b2z2f (x) + o(b2), (5.2.29) 2

and this leads to

Z 1 00 E(fˆ(x)) = K(z)(f(x) − bzf 0(x) + b2z2f (x) + o(b2))dz. (5.2.30) b 2

When R K(z)dz = 1, R zK(z)dz = 0, and R z2K(z)dz < ∞, we obtain

1 00 Z E(fˆ(x)) = f(x) + b2f (x) z2K(z)dz + o(b2). (5.2.31) b 2

R 2 We denote µ2 = z K(z)dz and the bias expression is

1 00 E(fˆ(x)) − f(x) = b2µ f (x) + o(b2). (5.2.32) b 2 2 70 The variance expression is

1 Z 1 Z V ar(fˆ(x)) = K2(z)f(x − bz)dz − K(z)f(x − bz)dz b nb n 1 Z 1 = K2(z)(f(x) + o(1))dz − (f(x) + o(1))2 (5.2.33) nb n 1 Z  1  = K2(z)dzf(x) + o . nb nb

We combine the variance in (5.2.33) and the square of bias in (5.2.32) to get

  1 1 00 1 MISE = R(K)f(x) + b4µ2(f (x))2 + o + b4 , (5.2.34) b nb 4 2 nb

where R(K) = R K2(z)dz. If we integrate the expression (5.2.34) where we take integral of probability density f(x) with respect to x, which is 1, and where we take integral of squared fˆ00 (x) denoted by R(f 00 (x)), then we have

  1 1 00 1 MISE = R(K) + b4µ2R(f (x)) + o + b4 b nb 4 2 nb (5.2.35)  1  = AMISE + o + b4 , b nb

where AMISEb is an asymptotic mean integrated squared error. The variance-bias trade-off is illustrated by AMISE with the terms of the squared bias and the variance. Note that the only unknown parameter in (5.2.35) is R(f 00 (x)). For a given random sample, we select a bandwidth which can be reduced to the optimization problem of finding b for a probability density f and kernel function K. We can minimize AMISE as defined in (5.2.35) as (R(K))1/5 bAMISE = 2 00 1/5 . (5.2.36) (nµ2R(f ))

Unfortunately, bAMISE depends on a probability density f which is unknown and must be es- timated. We discuss the most common bandwidth selections of the kernel density estimator as normal reference, unbiased cross-validation, and biased cross-validation. 71 1. Normal reference bandwidth selection A simple solution to estimate R(f 00 ), was suggested by Scott (1992), when we assume that f is the density of N(0, σ2) and we obtain

00 3 R(f ) = . (5.2.37) 8π1/2σ5

If we plug R(f 00 ) from (5.2.37) in (5.2.36), then the optimal value of bandwidth is

8π1/2R(K)1/5 bAMISE = 2 σ. (5.2.38) 3nµ2

Since a normal scale bandwidth selector is obtained from (5.2.38) by replacing σ with σ,ˆ we further have 8π1/2R(K)1/5 bAMISE = 2 σ,ˆ (5.2.39) 3nµ2

where the choice of σˆ here is smaller than the sample s and the sample in- terquartile range IQR of the standard normal density divided by 1.34. When we combine Silver-

−1/2 man’s (1986) rule-of-thumb with a normal kernel, where µ2 = 1 and R(K) = π /2, we have

41/5 b = n−1/5σˆ ≈ 1.06n−1/5σ,ˆ (5.2.40) nrd 3 where σˆ = min (s, IQR/1.34). The function bw.nrd in R implements the optimal bandwidth for the normal reference. 2. Unbiased cross-validation We consider the most popular method for cross-validation bandwidth selections to be Rudemo’s (1982) and Bowman’s (1984). This method is based on integrated squared error of the kernel ˆ function fb which is Z 2  ˆ  ISEb = fb(x) − f(x) dx, (5.2.41) 72 ˆ 1 Pn x−Xk where f(x) = nb k=1 K( b ). The integrated squared error can be expanded as

Z Z Z ˆ 2 ˆ 2 ISEb = (fb(x)) dx − 2 fb(x)f(x)dx + (f(x)) dx.

Since the first two terms are dependent on b and the third term is not a function of b, we can omit

it. The optimization of ISEb is

Z Z ˆ 2 ˆ ucvb = (fb(x)) dx − 2 fb(x)f(x)dx. (5.2.42)

Expression (5.2.42) is referred to as least squared cross-validation criterion. In order to find the optimal bandwidth b, Equation (5.2.42) has to be constructed from the data, then minimized with ˆ−i respect to b. Let fb (xi) be the leave-one-out density estimate

  1 X xi − Xk fˆ (x ) = K . (5.2.43) −i i (n − 1)b b i6=k

Rudemo (1982) pointed out that the second integral in (5.2.42) can be written as

 ! 1 X xi − Xk E(fˆ−i(x )) = E K b i (n − 1)b b i6=k Z n   ! 1 X x − Xk = E K f(x)dx (5.2.44) nb b k=1 Z  ˆ = E fb(x)f(x)dx .

ˆ−i When we substitute the second term in (5.2.42) with E(fb (xi)) in (5.2.44), we have the following function of the least squared cross-validation

n Z 2 X ucv = (fˆ(x))2dx − fˆ−i(x ). (5.2.45) b b n b i i=1

Scott and Terrell (1987) referred to this method as an unbiased cross-validation criterion because 73 its expectation is

 Z  Z 2 2 E(ucvb) = E ISEb − (f(x)) dx = MISEb − (f(x)) dx, (5.2.46)

where the mean integrated squared error can be written as

 1 1  MISE = AMISE + O + . b b n b5

The first term is the asymptotic mean integrated squared error given by

4 1 b Z 00 AMISE = R(K) + µ2 (f (x))2dx, (5.2.47) b nb 4 2 b

R 2 ˆ00 with µ2 = x K(x)dx, which is the second moment of the kernel K, and f (x) which is the second derivative of the kernel density. The aim of this method is to estimate b by minimizing

ucvb.

The unbiased cross-validation presents several local minima, and to select bucv we take the value that corresponds to the largest local minimum. Park and Marron (1990) found that unbiased cross-validation is the most poorly estimated density compared to other bandwidth selectors be- cause it presents a high amount of sampling variability. There is a function bw.ucv in R to find the

optimal value bucv for unbiased cross-validation. 3. Biased cross-validation Scott and Terrell (1987) introduced another method for cross-validation bandwidth selection called biased cross-validation, which is based on an asymptotic mean integrated squared error. This method is similar to unbiased cross-validation but it has a lower amount of sampling variability. Scott and Terrell improved the estimate of R (f 00 (x))2dx as

Z 00 Z 00 1 (f (x))2dx ≈ (fˆ (x))2dx − R(K), (5.2.48) b nb5 74 where fˆ00 (x) is the second derivative of the kernel density and the bias estimator is subtracted in

(5.2.48). Therefore, substituting (5.2.48) in AMISEb (5.2.47) gives a biased cross validation

4   1 b Z 00 1 bcv = R(K) + µ2 (fˆ (x))2dx − R(K) , (5.2.49) b nb 4 2 b nb5

2 where µ is the second moment of the kernel function K. To estimate bbcv, we have to mini- mize biased cross-validation. Note that the asymptotic variance of the bandwidth in biased cross- validation is lower than in unbiased cross-validation; hence, biased cross-validation tends to over- smooth a density. The function bw.bcv in R computes the biased cross-validation for bandwidth selection.

Empirical Example

We concentrate on a simulation of a mixture of normal distributions and compare three different bandwidth selections to determine which one is appropriate for the data. First, we illustrate a kernel density estimate when we generate a sample size of 1000 from a mixture of normal distributions, since we want to increase the order of estimation difficulty, as

1 1 f(x, y) = N(µ , Σ ) + N(−µ , Σ ), 2 1 1 2 1 2

      2 2 0 1 0       where µ1 =   , Σ1 =   , and Σ2 =   . 2 0 2 0 1 Contour plots of the true density and kernel estimate functions for three different bandwidth se- lections which are discussed above are presented in Figure 5.1. The purpose of looking at contour plots is to choose the bandwidth selection via minimizing the distance between the true density and estimated density. Therefore, we observe the normal reference results in a smoother estimate. 75

Figure 5.1 Contour plots of the true density and kernel estimate functions 76 5.2.3 Properties of Local Distance Correlation

In this section, we will discuss the properties of local distance correlation. While properties of distance correlation considered in Chapter 3, the properties of local distance correlation should remain unchanged. Here are the properties of the local distance correlation when p = q = 1:

2 1. Local distance correlation Rcb (X,Y ) is defined for any pair of random variables X and Y with finite expected values in each neighborhood of point.

2 2. The bias-corrected estimation of local distance correlation Rcb can be negative in lower tail, so we do not take the square root.

2 3. Rcb (X,Y ) = 0 if and only if X and Y are independent.

4. If Y is linear transformation of X, that is, Y = cX + a, then b2 = cb1, where b1, b2 are bandwidth values.

5. Symmetry:

2 2 • Exchange symmetry: Rcb (X,Y ) = Rcb (Y,X).

• Reflection symmetry:

2 2 Rcb (−X,Y ) = Rcb (X,Y ),

2 2 Rcb (X, −Y ) = Rcb (X,Y ).

• Radial symmetry:

2 2 Rcb (−X, −Y ) = Rcb (X,Y ).

2 • Rotations: Rcb (X,Y ) is a rotation invariant.

2 We know that distance correlation is between 0 and 1, but the U-statistic Un(X,Y ) can be negative

2 in the lower tail so we cannot take the square root of the U-statistic. Therefore, Rcb can take negative values so we cannot take the square root as well. Local distance correlation equals zero 77 in the neighborhood of each point if and only if X and Y are independent in that neighborhood.

For example, Y = 2X + 1 is a linear transformation of X. If X1, ..., X500 are sampled from a standard uniform distribution, the bandwidth for X is b1 = 0.087, and the bandwidth for Y is b2 = 2 × b1 = 0.174. To check the properties of symmetry, we illustrate them with an example. We consider the relationship between X and Y for a sample of size n = 500, where X was generated from 0 to 4 in equal steps and Y = sin(X) + , where  has a standard uniform distribution. A scatter plot in Figure 5.2 displays the relationship between two variables X and Y.

Figure 5.2 Scatter plot of X and Y

Figure 5.3 shows that local distance correlation estimation remains the same with exchange symmetry. The local distance correlation has reflection and radial symmetry as illustrated in Figure 5.4 and in Figure 5.5. 78

Figure 5.3 Illustration of exchange symmetry 79

Figure 5.4 Illustration of reflection symmetry 80

Figure 5.5 Illustration of radial symmetry

The rotation matrix in the bivariate case is

  cos θ − sin θ   R =   , sin θ cos θ and R[ XY ] are rotated counter-clockwise through an angle θ. Figure 5.6 shows the data rotated 90o and 180o. Figure 5.7 shows that local distance correlation estimation remained the same when we rotate data by angles θ = π and θ = π/2. 81

Figure 5.6 Scatter plots when rotated for 90o and 180o 82

Figure 5.7 Illustration of rotation symmetry

5.3 Simulation Study

The simulation study is conducted over different nonlinear constraints with n = 1000 obser- vations. We consider six different cases of nonlinear dependence structure to explore the local distance correlation between two variables X and Y . We compare visualization of local distance correlation with local Gaussian correlation for these different nonlinear constraints. Figure 5.8 shows the scatter plot of X and Y for each six nonlinear dependence structures. In addition, Fig- ure 5.9 displays contour plots which indicate height values on the X and Y axis where the variables are related in certain regions. The nonlinear relation between X and Y are evident from the scatter plots and contour plots. The Pearson correlation sometimes gives negative values or zero corre- lation, even thought the dependence is seen in the plots. Since the choice of bandwidth depends on user goals to estimate the local Gaussian correlation, the bandwidth is a challenge for users to 83 select. In this simulation, we use normal reference rule, which is discussed in Section 5.2.2, to select bandwidths of X and Y for both local distance correlation and local Gaussian correlation.

Figure 5.8 Scatter plots of different bivariate dependence structures 84

Figure 5.9 Contour plots of different bivariate dependence structures

1. Biquadratic curve

2 We generate X from a uniform distribution taking values in the range −1 and 1, and Yi = 4(Xi −

1 2 i 2 ) + 10 , where i are i.i.d with uniform distribution on (−1, 1) and independent of X. The scatter plot of the biquadratic curve displayed in Figure 5.8(1) and contour plot in Figure 5.9(1) show that the most observations are concentrated in the middle. The visualization of local Gaussian correlation and local distance correlation is shown in Figure

5.10 with bandwidths b1 = 0.15 and b2 = 0.11. The plot of local Gaussian correlation indicates pattern of positive and negative dependence, but there is local dependence in the center of the biquadratic curve for local distance correlation. 85

Figure 5.10 The visualization of local Gaussian correlation and local distance correlation

2. Sine curve

We generated X from 0 to 4 in equal steps and Yi = sin(Xi) + i, where i are i.i.d with standard uniform distribution and independent of X. Figure 5.8(2) displays the scatter plot, and Figure 5.9(2) presents the contour plot of sine curve showing that the variables are related to each other.

The visualization of local Gaussian correlation with bandwidth values b1 = 0.31 and b2 = 0.16 as shown in Figure 5.11 starts with positive dependence on the left side and turns into negative dependence. The visualization of local distance correlation in Figure 5.11 with the same bandwidth values shows local dependence between the variables. 86

Figure 5.11 The visualization of local Gaussian correlation and local distance correlation

3. Circle problem The circle problem is simulated as follows:

  X = sin(U π) + i ,Y = cos(U π) + i , i i 8 i i 8

where Ui are uniformly distributed taking values in the range −1 and 1. From the scatter plot in Figure 5.8(3) and contour plot in Figure 5.9(3), X and Y are clearly related. The normal reference rule resulted in the bandwidth values as b1 = b2 = 0.19. The visualization of local Gaussian cor- relation in Figure 5.12 shows that there are positive and negative dependent relationships between the variables, even though the scatter and contour plots exhibit a nonlinear relationship between the observations. Figure 5.12 displays the visualization of local distance correlation which provides an accurate visualization of the local dependence where it exists and independence otherwise. 87

Figure 5.12 The visualization of local Gaussian correlation and local distance correlation

4. The X curve The X curve data was taken from Newton (2009) which is a good example of a nonlinear relation-

2 i  ship. We generated variable X from −1 to 1 in equal steps and Yi = Ui Xi + 2 , where Ui is a random sample of −1 or 1, and i has the standard uniform distribution and is independent of X and U. See Figure 5.8(4) for the scatter plot of the X curve, and Figure 5.9(4) for the contour plot, which show how the variables are related.

The normal reference rule resulted in the bandwidth values as b1 = 0.15 and b2 = 0.18. Then the visualization of local Gaussian correlation in Figure 5.13 shows positive and negative dependence in the tails. In contrast, the visualization of local distance correlation in Figure 5.13 shows the local dependence between the two variables where it exists. 88

Figure 5.13 The visualization of local Gaussian correlation and local distance correlation

5. Bivariate mixed model We generate 50 percent mixture of bivariate normal components as

1 1 f(x, y) = N(µ , Σ ) + N(−µ , Σ ), 2 1 1 2 1 2

      2 2 0 1.5 0       where µ1 =   , Σ1 =   , and Σ2 =   . 2 0 2 0 1.5 The scatter plot in Figure 5.8(5) and the contour plot in Figure 5.9(5) appear to be bimodel. The Pearson correlation and bias-corrected distance correlation show that the two variables are dependent. The visualization of local Gaussian correlation in Figure 5.14 does not show local de- pendence clearly while the visualization of local distance correlation in Figure 5.14 shows clearly where the local dependence exists in the two regions. The normal reference rule results in the 89 bandwidths values as b1 = 0.64 and b2 = 0.63, but these bandwidth values do not smooth out the plot of the local Gaussian correlation.

Figure 5.14 The visualization of local Gaussian correlation and local distance correlation 90 6. Fan shape We consider an interesting example of a fan shape when we generate X from 0 to 4 in equal steps and Yi = Xii, where i are i.i.d normal distribution with mean 2 and standard deviation 2 and independent of X. The fan shape is described in the scatter plot in Figure 5.8(6) and the contour plot in Figure 5.9(6). The plots show that the two variables are closely related in the left most side of the fan shape while there is less relationship on the right side. The visualization of local Gaussian correlation and local distance correlation are shown in Figure 5.15 with bandwidths b1 = 0.31 and b2 = 1.22. While the picture of local Gaussian correlation shows positive dependence in the left most side and negative dependence in some regions, the visualization of local distance correlation shows that the two variables are dependent in the left most side and less dependent on the right side.

Figure 5.15 The visualization of local Gaussian correlation and local distance correlation 91 5.4 Real Examples

In this section, we will consider applications of local distance correlation and we compare them with local Gaussian correlation. The first example we will present here is the Aircraft data, which shows how local Gaussian correlation and local distance correlation can fit in nonlinear data. The second example, the Wage dataset, has a larger number of observations. The third example is the PRIM7 dataset which has a nonlinear relationship. The fourth example is the olive oils data discussed in Section 4.3.

5.4.1 Example 1: Aircraft

We consider the aircraft data from Bowman and Azzalini (1997). Szekely´ and Rizzo (2009) and Jones and Koch (2003) considered the log of values of wing span (in meters) and maximum speed (in km/h) for 230 aircrafts built in the third period between 1956 and 1984. A scatter plot in Figure 5.16 displays nonlinear relation between wing span and speed. A contour plot of a nonparametric density estimate is displayed in Figure 5.16, where height values are on the wing span and the maximum speed. 92

Figure 5.16 Scatter and contour plots for aircraft dataset

Berentsen and Tjøstheim (2014) considered this example and they used their bandwidth algo- rithm of likelihood cross-validation to find bandwidths. This algorithm gave small bandwidths as b1 = 0.21 and b2 = 0.19. Because the small bandwidths have large variability, they increased the bandwidths to b1 = 0.25 and b2 = 0.30 in order to obtain a smoother version of the visualization of local Gaussian correlation. We use the normal reference bandwidth selections which gives band- width values as b1 = 0.20 and b2 = 0.26. In Figure 5.17, we represent the visualization of local Gaussian correlation but it is not a smooth plot. The visualization of local distance correlation in Figure 5.17 shows that a local dependence between the wing span and maximum speed exists in the two regions.

5.4.2 Example 2: Wage

We consider another example where the data is in R package ISLR (2017) has 3, 000 obser- vations of male workers in the Mid-Atlantic region of the United States. We study association 93

Figure 5.17 The visualization of local Gaussian correlation and local distance correlation for aircraft dataset between wages (in thousands of dollars) and employees’ age. The smooth version of a scatter plot in Figure 5.18 shows that the wages increase with age between 20 and 60, and then wages decrease after age 60. The bias-corrected squared distance correlation is 0.058, and the squared Pearson correlation is 0.047. 94

Figure 5.18 Smooth scatter plot for Wage dataset

The bandwidth values of the normal reference method used in this example are b1 = 2.47 and b2 = 0.0654. We represent the visualization of local Gaussian correlation in Figure 5.19 and we see that the plot does not give much information when the local dependence distinguishes between positive and negative dependence. We obtain a graphical interpretation of dependence between the variables using local distance correlation in Figure 5.19. 95

Figure 5.19 The visualization of local Gaussian correlation and local distance correlation for Wage dataset

5.4.3 Example 3: PRIM7

The PRIM7 data set presented by Friedman and Tukey (1974) contains 7 variables with 500 observations each. PRIM7 data is taken from a high-energy particle physics scattering experiment. We consider the scatter plot of 2 out of 7 variables X6 and X5 in Figure 5.20. From the plots, we see that the observations concentrate on the right side and have nonlinear dependence. The bias-corrected squared distance correlation is 0.3049, and the Pearson correlation shows negative dependence between the variables, ρˆ = −0.435.

The normal reference method is used to find bandwidths b1 = 0.414 and b2 = 2.167. For this example, we see whether the local Gaussian correlation can work for this type of observations. The visualization of local Gaussian correlation in Figure 5.21 shows that the plot is not smooth. In Fig- 96 ure 5.21, the visualization of local distance correlation clearly shows the strong local dependence on the right side.

Figure 5.20 Scatter and smooth scatter plots for PRIM7 dataset 97

Figure 5.21 The visualization of local Gaussian correlation and local distance correlation for PRIM7 dataset

5.4.4 Example 4: Olive Oils

In Chapter 4, we consider the example of olive oils with 6 fatty acids and 572 observations each. We divide the data into pairwise fatty acids as the following: oleic and palmitoleic of the monounsaturated fats, palmitic and steartic of the saturated fats, and linoleic and linolenic of the polyunsaturated fats. We study the local relation between oleic and palmitoleic of the monounsaturated fats. The smooth scatter plot is displayed in Figure 5.22 and shows linear relation between two fatty acids. The bias-corrected squared distance correlation is 0.6838, and the Pearson correlation shows strong negative correlation, ρˆ = −0.852. 98

Figure 5.22 Smooth scatter plot for oleic and palmitoleic fatty acids

The bandwidths b1 = 0.1562 and b2 = 1.208 are resulted in from the normal reference rule. In Figure 5.23, we represent the visualization of local distance correlation which clearly shows the linear local dependence, and the visualization of local Gaussian correlation shows a negative linear relation between oleic and palmitoleic fatty acids. Therefore, both the local distance correlation and the local Gaussian correlation show a strong relation between oleic and palmitoleic fatty acids. 99

Figure 5.23 The visualization of local Gaussian correlation and local distance correlation for oleic and palmitoleic fatty acids

We also study the local relation between palmitic and steartic fatty acids of the saturated fats. The smooth scatter plots shows a nonlinear relation between two fatty acids in Figure 5.24. The bias-corrected squared distance correlation is 0.0198, and the Pearson correlation has negative values, ρˆ = −0.1703. 100

Figure 5.24 Smooth scatter plot for palmitic and steartic fatty acids

The normal reference rule is used to find bandwidths as b1 = 0.50 but we increase the band- width of b2 from 0.098 to 0.50 to smooth out the plots. In Figure 5.25, we represent the visu- alization of local distance correlation which shows that a local dependence occurs in the small regions. The visualization of local Gaussian correlation in Figure 5.25 shows local independence when there is clearly a relation between the variables. 101

Figure 5.25 The visualization of local Gaussian correlation and local distance correlation for palmitic and steartic fatty acids

To study the local relation between linoleic and linolenic of the polyunsaturated fats, we con- sider the smooth scatter plot of two fatty acids represented in Figure 5.26. We observe no strong linear relation between linoleic and linolenic fatty acids. The bias-corrected squared distance cor- relation is 0.066, and the Pearson correlation gives negative low correlation, ρˆ = −0.057. 102

Figure 5.26 Smooth scatter plot for linoleic and linolenic fatty acids

The normal reference bandwidths are b1 = 0.723 and b2 = 0.031. The visualization of local Gaussian correlation in Figure 5.27 shows the negative and positive dependence but it does not carry enough information about local dependence. In Figure 5.27, we represent the visualization of local distance correlation which shows where the local relation between two fatty acids exists. 103

Figure 5.27 The visualization of local Gaussian correlation and local distance correlation for linoleic and linolenic fatty acids 104

CHAPTER 6 SUMMARY AND FUTURE WORK

In this dissertation, we apply the jackknife empirical likelihood method for a U-statistic to construct a confidence interval for distance correlation. The unbiased version of distance covari- ance is defined as an inner product in the Hilbert space Hn of U-centered distance matrices, and is a U-statistic. The normalized coefficient of an unbiased distance covariance is called the bias- corrected distance correlation. The jackknife pseudo-sample of distance correlation becomes a sample of asymptotically independent observations. The jackknife empirical likelihood method can be applied to construct confidence interval for the mean of the jackknife pseudo-sample of dis- tance correlation. We prove that a Wilks’ theorem for jackknife empirical likelihood for distance correlation still holds when two negative jackknife empirical log-likelihood ratio converges to a chi-squared distribution. The simulation study gives the results of the coverage probability and average length of con- fidence interval for distance correlation based on jackknife empirical likelihood and the standard normal bootstrap method for the three different nominal confidence levels of 90%, 95%, and 99%. We observe that coverage probabilities of the jackknife empirical likelihood for distance correla- tion are quite accurate when we compare them with the bootstrap method. The average lengths for the two methods are close. The jackknife empirical likelihood confidence intervals for distance correlation is discussed in real data examples. We show that jackknife empirical likelihood method for a U-statistic can be computed locally to estimate and to visualize the local distance correlation between two univariate random variables. The local distance correlation estimations are computed in small regions which better describe the dependence structure. We use the Gaussian kernel function estimator in the jackknife empirical log-likelihood to estimate distance correlation locally. The choice of bandwidths is important be- cause a large bandwidth gives more bias but less variable estimates unlike a small bandwidth which gives less bias but more variable estimates. The three common kinds of bandwidth selections for kernel function, normal reference, unbiased cross-validation, and biased cross-validation, are dis- 105 cussed in detail. We consider the example of generating data from a mixture of normal distributions to help determine how to choose the bandwidth for data. We also look at the contour plots of the true density and three kernel function densities. We use the normal reference bandwidth selection in this paper because it can minimize the distance between the true density and its kernel function density in the normal mixture example. Since many distributions can be approximated by normal mixture models, the normal reference rule may perform well as a default. Two cross-validation methods did not perform as well on the normal mixture example. The estimation of local distance correlation can be negative in lower tail, so we do not take the square root. The population local distance correlation equals zero in the neighborhood of each point if and only if two univariate variables are independent in their neighborhood. When we translate, rotate, exchange, or reflect the data points, the local distance correlations do not change. Therefore, the properties of local distance correlation have the same properties of distance correlation. In comparing the visualization of local distance correlation and the local Gaussian correlation for six different nonlinear constraints, we observe that the local distance correlation has good performance to capture local dependence. The four real data examples show that the local distance correlation describes the local measure of dependence better than the local Gaussian correlation. It is difficult to determine the optimal bandwidth values for local Gaussian correlation. In our comparison we applied normal reference rule for both methods. We have implemented a confidence interval for distance correlation and local distance correla- tion via jackknife empirical likelihood in R, and the methods are also easily implemented in other programs such as Python or Matlab. Further research could compute the local distance correlation for multivariate data and improve the computational complexity of distance correlation for multivariate variables to handle large datasets. However, in high dimension we cannot reduce the O(n2) computational complexity of the statistics. Because for bivariate data we use a fast O(n log n) algorithm for the computation of bias-corrected distance correlation, we have been able to build a fast visualization tool for local 106 distance correlation. Further research could also construct the confidence interval for distance correlation when the distance covariance is a V -statistic, as in the original definition of distance covariance gives in Szekely,´ Rizzo, and Bakirov (2007). 107

BIBLIOGRAPHY

Aspiras-Paler, M. (2015). On modern measures and tests of multivariate independence. Disserta- tion, Bowling Green.

Berentsen, G., B. Støve, D. Tjøstheim, and T. Nordbø (2014). Recognizing and visualizing copu- las: An approach using local Gaussian approximation. Insurance: Mathematics and Economics, 57 90–103.

Berentsen, G. and D. Tjøstheim (2014). Recognizing and visualizing departures from indepen- dence in bivariate data using local Gaussian correlation. Statistics and Computing, 24(5) 785– 801.

Bhuchongkul, S. K. (1964). A class of nonparametric tests for independence in bivariate popula- tions. Annals of Mathematical Statistics, 35 138–149.

Bjerve, S. and K. Doksum (1993). Correlation curves: Measures of association as functions of covariate values. The Annals of Statistics, 21 890–902.

Blomqvist, N. (1950). On a measure of dependence between two random variables. Annals of Mathematical Statistics, 21 593–600.

Blum, J., J. Kiefer, and M. Rosenblatt (1961). Distribution free tests of independence based on the sample distribution function. Annals of Mathematical Statistics, 32 485–498.

Bowman, A. (1984). An alternative method of cross-validation for the smoothing of density esti- mates. Biometrika, 71 353–360.

Bowman, A. and A. Azzalini (1997). Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford: Oxford University Press.

Bravais, A. (1844). Analyse mathematique´ sur les probabilites´ des erreurs de situation d’un point. Impr. Royale. 108 Carpenter, J. and J. Bithell (2000). Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19 1141–1164.

Csorgo,¨ S. (1985). Testing for independence by empirical characteristic function. Journal of Multivariate Analysis, 16 290–299.

Delicado, P. and M. Smrekar (2009). Measuring non-linear dependence for two random variables distributed along a curve. Statistics and Computing, 19 255–269.

Doksum, K., S. Blyth, E. Bradlow, X. Meng, and H. Zhao (1994). Correlation curves as local measures of variance explained by regression. Journal of the American Statistical Association, 89 571–582.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7 1–26.

Feuerverger, A. (1993). A consistent test for bivariate dependence. International Statistical Re- view, 61 419–433.

Feuerverger, A. and R. Mureika (1977). The empirical characteristic function and its applications. The Annals of Statistics, 5 88–97.

Fisher, R. (1941). Statistical methods for research workers (8th ed.). Y. E. Stechert.

Forina, M. and E. Tiscornia (1982). Pattern recognition methods in the prediction of Italian olive oil origin by their fatty acid content. Annali di Chimica, 72 143–155.

Friedman, J. and J. Tukey (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 29 881–890.

Galton, F. (1888). Co relations and their measurement, chiefly from anthropometric data. Pro- ceedings of the Royal Society of London, 45 135–145.

Galton, F. (1890). Kinship and correlation. North American Review, 150 419–431. 109 Gareth, J., D. Witten, T. Hastie, and R. Tibshirani (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R package version 1.2.

Gebelein, H. (1941). Das statistische problem der korrelation als variantions und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. z. Angew. Math. Mech., 21 364–379.

Geiser, P. and R. Randles (1997). A nonparametric test of independence between two vectors. Journal of American Statistical Association, 92 561–567.

Hjort, N. and M. Jones (1996). Locally parametric nonparametric density estimation. Annals of Statistics, 24(4) 1619–1647.

Hoeffding, W. (1948a). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3) 293–325.

Hoeffding, W. (1948b). A nonparamteric test of independence. The Annals of Mathematical Statistics, 19(4) 546–557.

Hotelling, H. and M. Pabst (1936). Rank correlation and tests of significance involving no assump- tion of normality. The Annals of Mathematical Statistics, 7 29–43.

Huo, X. and G. J. Szekely´ (2016). Fast computing for distance covariance. Technometrics, 58 435–447.

Jing, B., J. Yuan, and W. Zhou (2009). Jackknife empirical likelihood. Journal of the American Statistical Association, 104 1224–1232.

Jones, C. and I. Koch (2003). Dependence maps: Local dependence in practice. Statistics and Computing, 13 241–255.

Jones, M. (1996). The local dependence function. Biometrika, 83 899–904.

Kendall, M. (1938). A new measurement of rank correlation. Biometrika, 30 81–93. 110 Kitamura, Y., G. Tripathi, and H. Ahn (2004). Empirical likelihood based inference in conditional moment restriction models. Econometrica, 72 1667–1714.

Kruskal, W. (1958). Ordinal measures of dependence. Journal of American Statistical Association, 53 814–861.

Lenth, R. (1983). Some properties of U statistics. The American Statistician, 37 311–313.

Li, H. (2015). On nonsymmetric nonparametric measures of dependence. arXiv, 1–18.

Linfoot, E. H. (1957). An informational measure of correlation. Information and Control, 1 85–89.

Lyons, R. (2013). Distance covariance in metric spaces. Annals of Probability, 41 3284–330.

Mori,´ T. and G. Szekely´ (2019). Four simple axioms of dependence measures. Metrika, 82 1–16.

Newton, M. (2009). Introducing the discussion paper by Szekely´ and Rizzo. The Annals of Applied Statistics, 3 1233–1235.

Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics, 18 90–120.

Park, B. and J. Marron (1990). Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association, 85 66–72.

Pearson, K. (1896). Mathematical contributions to the theory of evolution III. regression, heredity and panmixia. Philosophical Transactions of the Royal Society A, 187 253–318.

Peng, H. and F. Tan (2018). Jackknife empirical likelihood goodness-of-fit tests for U-statistics based general estimating equations. Bernoulli, 24 449–464.

Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43 353–360.

R Core Team (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. 111 Renyi,´ A. (1959). On measures of dependence. Acta Mathematica Academiae Scientiarum Hun- garica, 10 441–451.

Rizzo, M. and G. Szekely´ (2013). energy: E-Statistics: Multivariate Inference via the Energy of Data. R package version 1.7.

Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9 65–78.

Scott, D. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. New York: John Wiley and Sons.

Scott, D. and G. Terrell (1987). Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association, 82 1131–1146.

Sen, P. K. (1977). Some invariance principles relating to jackknifing and their role in sequential analysis. The Annals of Statistics, 5 316–329.

Serfling, R. J. (2009). Approximation Theorems of Mathematical Statistics. New York, NY: John Wiley and Sons.

Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27 379–423.

Shen, C., C. Priebe, and J. Vogelstein (2018). From distance correlation to multiscale graph corre- lation. Journal of American Statistical Association, 1–22.

Silverman, B. (1986). Density Estimation for Statistics and Data Analysis. New York: Chapman and Hall/CRC.

Sinha, B. and H. Wieand (1977). Multivariate nonparametric tests for independence. Journal of Multivariate Analysis, 7(4) 572–583. 112 Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15 72–101.

Szekely,´ G. and M. Rizzo (2009). Brownian distance covariance. The Annals of Applied Statistics, 3 1236–1265.

Szekely,´ G. and M. Rizzo (2013). The distance correlation t-test of independence in high dimen- sion. Journal of Multivariate Analysis, 117 193–213.

Szekely,´ G. J. and M. L. Rizzo (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42 2382–2412.

Szekely,´ G. J., M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics, 35 2769–279.

Taskinen, S., A. Kakainen, and H. Oja (2003). Sign test of independence between two random vectors. Statistics and Probability Letters, 62 9–21.

Taskinen, S., H. Oja, and R. Randles (2005). Multivariate nonparametric tests of independence. Journal of the American Statistical Association, 100 916–925.

Tjøstheim, D. and K. Hufthammer (2013). Local Gaussian correlation: a new measure of depen- dence. Journal of Econometrics, 172(1) 33–48.

Wand, M. and M. Jones (1995). Kernel Smoothing. London: Chapman and Hall/CRC.

Wilks, S. (1935). On the independence of k sets of normally distributed statistical variables. Econo- metrica, 3 309–326.

Zhu, L., K. Xu, R. Li, and W. Zhong (2017). Projection correlation between two random vectors. Biometrika, 104 829–843. 113

APPENDIX A SELECTED R PROGRAMS

The following are used to simulate six different dependence structures:

• Biquadratic curve

i <− 1

xx <− r u n i f ( n , −1 ,1)

yy <− 4∗( ( xx ˆ2 − 1/2 )ˆ2 + runif(n, − 1 , 1 ) / 1 0 )

x [ , i ] <− xx

y [ , i ] <− yy

• Sine Curve

i <− 2

xx <− seq(0,4,length=n)

yy <− sin(xx)+runif(n,0,1)

x [ , i ] <− xx

y [ , i ] <− yy

• Circle Problem

i <− 3

u <− r u n i f ( n , −1 ,1)

xx <− s i n ( u∗ pi ) + rnorm( n )/8

yy <− cos ( u∗ pi ) + rnorm( n )/8

x [ , i ] <− xx

y [ , i ] <− yy

• The X curve

i <− 4

xx <− seq ( −1,1, length=n ) 114 yy <− (xx ˆ2 + runif(n)/2 ) ∗ ( sample ( c ( −1,1), size=n, replace=T ) )

x [ , i ] <− xx

y [ , i ] <− yy

• Bivariate Mixed Normal

i <− 5

mu <− c ( 2 , 2)

Sigma1 <− diag(c(2, 2))

Sigma2 <− diag(c(1.5, 1.5))

samp <− rbind(mvtnorm::rmvnorm(n = n / 2, mean = mu, sigma = Sigma1),

mvtnorm::rmvnorm(n = n / 2, mean = −mu, sigma = Sigma2))

x [ , i ] <− samp [ , 1 ]

y [ , i ] <− samp [ , 2 ]

• Fan Shape

i <− 6

xx <− seq(0,4,length=n)

yy <− xx∗ rnorm(n,2 ,2)

x [ , i ] <− xx

y [ , i ] <− yy

The following are used to compute local Gaussian correlation and local distance correlation:

• R function to compute local distance correlation

l o c a l . dcor <− function(x,y,h1,h2){

library(energy)

if (!is.vector(x) | | !is.vector(y)) {

i f (NCOL( x ) > 1 | | NCOL( y ) > 1)

stop(”This method is only for univariate x and y”)

} 115 n <− l e n g t h ( x ) x0 <− seq(min(x),max(x), length=15) y0 <− seq(min(y),max(y),length=15) xy . mat <− data.frame(x0,y0) ax <− outer(xy.mat[,1], x, ” −”)/ h1 ay <− outer(xy.mat[,2], y, ” −”)/ h2 wt <− tcrossprod(matrix(myfun(ax), ,n),

matrix(myfun(ay), , n))/(n ∗ h1 ∗ h2 ) wt [ wt <0.001]=0 nw<− nrow ( wt ) z <− numeric ( n )

Un <− dcor2(x,y, type=”U”)

for(i in 1:n)

z [ i ] <− n∗ Un − ( n − 1) ∗ dcor2 ( x[− i ] , y[− i],type=”U”) t h e t a F <− function(theta ,z,n,wt){

w t t <− wt

u <− z−t h e t a

B <− 0.02/max(abs(u))

lamF <− function(lam,u,wtt) { sum ( w t t ∗u / ( 1 + lam∗u ) ) }

if(lamF(0,u,wtt) == 0) lam0 <− 0

e l s e {

if( lamF(0,u,wtt) > 0 ) { l o <− 0

up <− B

while(lamF(up,u,wtt) > 0)

up <− up + B

}

e l s e {up <− 0

l o <− − B

while(lamF(lo ,u,wtt) < 0 ) 116 l o <− l o − B

}

lam0 <− optimize(lamF,lower=lo ,upper=up,

tol=.Machine$double.epsˆ0.25 ,

maximum=TRUE, u=u , wtt=wtt )$maximum

}

pk <− (1 + lam0∗u )

logpk <− sum ( w t t ∗ l o g ( pk ) )

return(logpk)

}

e s t . wt <− matrix(0,nw,nw

for(j in 1:nw){

for(i in 1:nw){

if(wt[i,j] ==0){

est.wt[i,j]<− 0

}

e l s e

{

est.wt[i,j] <− −suppressWarnings(optimize(thetaF ,

lower = min(z∗wt[i ,j]) ,upper=max(z∗wt [ i , j ] ) ,

tol =.Machine$double . eps ˆ0.5 ,maximum=TRUE,n=n,

z=z,wt=wt[i , j ])$maximum)

}

}

}

l d c o r <− data.frame(expand.grid(x=xy.mat[,1], y=xy.mat[ ,2]) ,

z=as.vector(est.wt))

return(ldcor)

} 117 • The visualization of local Gaussian correlation and local distance correlation

library(localgauss)

lg.out = localgauss (x , y, b1= bw.nrd(x), b2= bw.nrd(y) )

g=data.frame(lg.out$par.est)

lg.out1=data.frame(cbind(x=lg.out$xy.mat[,1],y=lg.out$xy.mat[,2]

,rho=g$rho))

p l o t x y 1 <− ggplot() + layer(data=lg.out1 , mapping=aes(x=x, y=y,

f i l l = rho ) ,

geom=”tile”, stat =”identity”,

position = ”identity”)+

s c a l e f i l l gradient2 (midpoint=0,low=low c o l o r ,

high = h i g h color , space=”Lab”, limits=c( −1 ,1) ,

breaks=seq( −1, 1, by= .2))+

geom point(data1 ,mapping=aes(x,y))+

ggtitle(”Local Gaussian correlation”)

l d c o r <− local .dcor(x,y,h1=bw.nrd(x),h2=bw.nrd(y))

dcor <− data.frame(cbind(x=ldcor[,1],y=ldcor[,2],

dcor1=round(ldcor [ ,3] ,2)))

p l o t x y 2 <− ggplot() + layer(data=dcor , mapping=aes(x=x, y=y,

fill=dcor1),

geom=”tile”,stat =”identity”,position = ”identity”)+

s c a l e f i l l gradient2(space=”Lab”, limits=c(low,high),

breaks=seq(low,high , by= .1))+

geom point(data1 ,mapping=aes(x,y))+

ggtitle(”Local distance correlation”)