Package ‘GiniDistance’

June 28, 2019 Type Package Title A New Gini Correlation Between Quantitative and Qualitative Variables Version 0.1.0 Author Dao Nguyen and Xin Dang Maintainer Dao Nguyen Description An implementation of a new Gini and correlation to measure dependence be- tween a categorical and numerical vari- ables. Dang, X., Nguyen, D., Chen, Y. and Zhang, J., (2018) . Depends R(>= 3.0.0) Imports Rcpp (>= 1.0.0), energy, readxl, randomForest LinkingTo Rcpp, RcppArmadillo License GPL (>= 2) Encoding UTF-8 LazyData true Collate GiniDistance-package.R RcppExports.R gmd.R gCov.R gCor.R dCov.R dCor.R Kgmd.R KgCov.R KgCor.R KdCov.R KdCor.R ConfidenceInterval.R PermutationTest.R utils.R RoxygenNote 6.1.0 NeedsCompilation yes Repository CRAN Date/Publication 2019-06-28 12:30:03 UTC

R topics documented:

GiniDistance-package ...... 2 ConfidenceInterval ...... 3 CriticalValue ...... 4 dCor...... 5 dCov...... 6

1 2 GiniDistance-package

gCor...... 8 gCov...... 9 gmd...... 10 KdCor ...... 12 KdCov...... 13 KgCor ...... 15 KgCov...... 16 Kgmd...... 17 PermutationTest ...... 18 RcppgCor ...... 19 RcppgCov ...... 20 RcppGmd ...... 21 RcppKgCor ...... 22 RcppKgCov ...... 23 RcppKGmd ...... 24

Index 25

GiniDistance-package GiniDistance

Description A new Gini correlation to measure dependence between categorical and numerical variables are implemented. Analogous to Pearson in ANOVA model, the Gini correlation is interpreted as the ratio of the between-group variation and the total variation, but it characterizes independence (zero Gini correlation mutually implies independence). Closely related to the distance correlation, the Gini correlation is of the simple formulation by considering the nature of the categorical variable. As a result, the Gini correlation has a lower computational cost than the distance correlation and is more straightforward to perform inference. The dependence test and confidence interval are implemented. Also, the corresponding kernelized dependence measures are also implemented.

Details The details are described in the following papers "A new Gini correlation between quantitative and qualitative variables" and "Estimating Feature-Label Dependence Using Gini Distance "

Author(s) Dao Nguyen and Xin Dang

References Dang, X., Nguyen, D., Chen, Y. and Zhang, J., (2019). A new Gini correlation between quantitative and qualitative variables, Journal of the American Statistical Association (submitted), https:// arxiv.org/pdf/1809.09793.pdf Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted), https://arXiv.org/pdf/1906.02171.pdf ConfidenceInterval 3

ConfidenceInterval Confidence Interval of Dependence measure

Description Find confidence intervals for dependence measures in which Xs are quantitative, Y are categorical using jack-knife method.

Usage ConfidenceInterval(x, y, sigma, alpha, level, method)

Arguments x data y label of data or univariate response variable sigma kernel parameter alpha exponent on Euclidean distance, in (0,2] level level of confidence, in [0,1] method name of dependence measure which can chosen from "gCor","gCov","dCor","dCov","KgCor", "KgCov", "KdCor" and "KdCov"

Details ConfidenceInterval compute the confidence interval of the distance correlation statistics. It is a self-contained R function returning a of the measure of dependence statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. alpha if missing by default is 1, otherwise it is exponent on the Euclidean distance.

Suppose a sample data D = {(xi, yi)} for i = 1, ..., n available. The confidence interval is built upon the asymptotic normality of sample dependence statistic. The asymptotic variance is estimated by the Jackknife method. More details refer to Shao and Tu (1996).

Value ConfidenceInterval returns the confidence interval of distance correlation

References Dang, X., Nguyen, D., Chen, Y. and Zhang, J. (2019). A new Gini correlation between quantitative and qualitative variables. Submitted. Shao, J. and Tu, D. (1996). The Jackknife and Bootstrap. Springer, New York. 4 CriticalValue

Examples x <- iris[,1:4] y <- unclass(iris[,5]) ConfidenceInterval(x, y, alpha=1, level=0.95, method='gCor')

CriticalValue Find a critical value by permutation test of dependence between X and Y using kernel (Gini) distance covariance or correlation statistics

Description Find a critical value by permutation test using variance of kernel (Gini) distance covariance or correlation statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard devi- ation, alpha is an exponent on Euclidean distance and returns the critical value of the measures of dependence.

Usage CriticalValue(x, y, sigma, alpha, level, M = 1000, method)

Arguments x data y label of data or univariate response variable sigma kernel alpha exponent on Euclidean distance, in (0,2] level significance level of the test, the default value = 0.05 M number of permutations method string name of the method for permutation test, e.g. gCov

Details CriticalValue compute the critical value of a dependence test of a kernel (Gini) distance covari- ance or correlation statistics. It is a self-contained R function returning the critical value of the measure of dependence statistics. The critical value of the test of significance level γ, however, is obtained by a permutation pro- cedure. Let ν = 1 : n be the vector of original sample indices of the sample for Y labels and ρˆg(α) =ρ ˆ(ν; α). Let π(ν) denote a permutation of the elements of ν and the corresponding ρˆg(π; α) is computed. Under the H0, ρˆg(ν) and ρˆg(π; α) are identically distributed for every permutation π of ν. Hence, based on M permutations, the critical value qγ is estimated by the (1 − γ)100% sample quantile of ρˆg(πm; α), m = 1, ..., M. Usually 100 ≤ M ≤ 1000 is sufficient for a good estimation on the critical value. See PermutationTest for a test of multivariate independence based on the (Gini) distance statistic. dCor 5

Value CriticalValue returns return the critical value of the measures of the dependence of the permuta- tion test of a specified function

See Also PermutationTest

Examples n = 50 x <- runif(n) y <- c(rep(1,n/2),rep(2,n/2)) CriticalValue(x, y, sigma=1, alpha=2, level=0.04, M = 1000, method='KgCov')

dCor Distance Covariance and Correlation Statistics

Description Computes distance covariance and correlation statistics, in which Xs are quantitative and Ys are categorical and return the measures of dependence.

Usage dCor(x, y, alpha)

Arguments x data y label of data or univariate response variable alpha exponent on Euclidean distance, in (0,2]

Details The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. dCor calls dcor function from energy package which computes the distance correlation between X and Y where both are numerical variables. If Y is categorical, the set difference metric on the support of Y is used. That is, d(y, y0) = |y − y0| := I(y 6= y0), where I(·) is the indicator function. Then the sample distance correlation between data and labels is computed as follows.

Let A = (aij) be a symmetric, n × n, centered distance matrix of sample x1, ··· , xn. The (i, j)-th 1 1 1 α entry of A is aij − n−2 ai· − n−2 a·j + (n−1)(n−2) a·· if i 6= j and 0 if i = j, where aij = kxi −xjk , Pn Pn Pn ai· = j=1 aij, a·j = i=1 aij, and a·· = i,j=1 aij. Similarly, using the set difference metric, a symmetric, n × n, centered distance matrix is calculated for samples y1, ··· , yn and denoted by B = (bij). Unbiased estimators of dCov(X,Y ; α), dCov(X, X; α) and dCov(Y, Y; α) are given 6 dCov

1 P 1 P 2 1 P 2 respectively as, n(n−3) i6=j AijBij, n(n−3) i6=j Aij and n(n−3) i6=j Bij. Then the distance correlation is

dCov(X, Y, α) dCor(X,Y ; α) = . p dCov(X, X; α)p dCov(Y,Y )

Value dCor returns the sample distance variance of x, distance variance of y, distance covariance of x and y and distance correlation of x, y.

References Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41 (5), 3284- 3305. Szekely, G. J., Rizzo, M. L. and Bakirov, N. (2007). Measuring and testing dependence by correla- tion of distances. Annals of Statistics, 35 (6), 2769-2794. Rizzo, M.L. and Szekely, G.J., (2017). Energy: E-Statistics: Multivariate Inference via the Energy of Data (R Package), Version 1.7-0.

See Also dCov KdCov KdCor

Examples x <- iris[,1:4] y <- unclass(iris[,5]) dCor(x, y, alpha = 1)

dCov Distance Covariance Statistic

Description Computes distance covariance statistic, in which Xs are quantitative and Y are categorical and return the measures of dependence.

Usage dCov(x, y, alpha)

Arguments x data y label of data or response variable alpha exponent on Euclidean distance, in (0,2] dCov 7

Details

dCov calls dcov function from energy package to compute distance covariance statistic. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. The distance covariance (Sezekley07) is extended from Euclidean space to general metric spaces by Lyons (2013). Based on that idea, we define the discrete metric

d(y, y0) = |y − y0| := I(y 6= y0),

where I(·) is the indicator function. Equipped with this set difference metric on the support of Y and Euclidean distance on the support of X, the corresponding distance covariance and distance correlation for numerical X and categorical Y variables are as follows.

Let A = (aij) be a symmetric, n × n, centered distance matrix of sample x1, ··· , xn. The (i, j)-th 1 1 1 α entry of A is aij − n−2 ai· − n−2 a·j + (n−1)(n−2) a·· if i 6= j and 0 if i = j, where aij = kxi −xjk , Pn Pn Pn ai· = j=1 aij, a·j = i=1 aij, and a·· = i,j=1 aij. Similarly, using the set difference metric, a symmetric, n × n, centered distance matrix is calculated for samples y1, ··· , yn and denoted by B = (bij). Unbiased estimators of dCov(X, Y; α) is 1 P n(n−3) i6=j AijBij.

Value

dCov returns the sample distance covariance between data x and label y.

References

Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41 (5), 3284- 3305. Rizzo, M.L. and Szekely, G.J., (2017). Energy: E-Statistics: Multivariate Inference via the Energy of Data (R Package), Version 1.7-0. Szekely, G. J., Rizzo, M. L. and Bakirov, N. (2007). Measuring and testing dependence by correla- tion of distances. Annals of Statistics, 35 (6), 2769-2794.

See Also

dCor KdCov KdCor

Examples

x <- iris[,1:4] y <- unclass(iris[,5]) dCov(x, y, alpha = 1) 8 gCor

gCor Gini Distance Covariance and Correlation Statistics

Description Computes Gini distance covariance and correlation statistics, in which Xs are quantitative, Y are categorical, alpha is exponent on the Euclidean distance and returns the measures of dependence.

Usage gCor(x, y, alpha)

Arguments x data y label of data or univariate response variable alpha exponent on Euclidean distance, in (0,2)

Details gCor compute Gini distance correlation statistics. It is a self-contained R function returning a measure of dependence statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. alpha if missing by default is 1, otherwise it is exponent on the Euclidean distance.

Suppose a sample data D = {(xi, yi)} for i = 1, ..., n available. The sample counterparts can be easily computed. Let Ik be the index set of sample points with yi = Lk, then pk is estimated by nk the sample proportion of that category, that is, pˆk = n where nk is the number of elements in Ik. With a given α ∈ (0, 2), a point estimator of ρg(α) is given as follows.  −1 nk X ∆ˆ (α) = kx − x kα, k 2 i j i

−1 n X ∆(ˆ α) = kx − x kα, 2 i j 1=i

Value gCor returns the sample Gini distance covariacne and correlation between x and y.

References Dang, X., Nguyen, D., Chen, Y. and Zhang, J. (2019). A new Gini correlation between quantitative and qualitative variables. Submitted to Journal of American Statistics Association. gCov 9

See Also gmd gCov KgCov KgCor

Examples x <- iris[,1:4] y <- unclass(iris[,5]) gCor(x, y, alpha = 1)

gCov Gini Distance Covariance Statistics

Description Computes Gini distance covariance statistics, in which Xs are quantitative, Y are categorical, alpha is an exponent on Euclidean distance and returns the measures of dependence.

Usage gCov(x, y, alpha)

Arguments x data y label of data or univariate response variable alpha exponent on Euclidean distance, in (0,2]

Details gCov compute Gini distance covariance statistics. It is a self-contained R function returning a measure of dependence statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. alpha if missing by default is 1, otherwise it is exponent on the Euclidean distance. Gini distance covariance is a new measure of dependence between random vectors and its labels. For all distributions with finite first moments, Gini distance correlation gCov has the following fundamental properties: (1) gCov(X,Y) is defined for X in arbitrary dimension quantitive variable and Y a univariate cate- gorical variable. (2) gCov(X,Y)=0 characterizes independence of X and Y . Gini distance covariance satisfies 0 ≤ gCov(X,Y ), and gCov = 0 only if X and Y are indepen- dent. Gini distance covariance gCov provides a new approach to the problem of testing the joint independence of random vectors. The formal definitions of the population coefficients gCov is given in (DNCZ 2018). The empirical Gini distance covariance gCovn(X,Y ; alpha) is the nonnegative number computed as follows. 10 gmd

Suppose a sample data D = {(xi, yi)} for i = 1, ..., n available. The sample counterparts can be easily computed. Let Ik be the index set of sample points with yi = Lk, then pk is estimated by nk the sample proportion of that category, that is, pˆk = n where nk is the number of elements in Ik. With a given α ∈ (0, 2), a point estimator of ρg(α) is given as follows.

 −1 nk X ∆ˆ (α) = kx − x kα, k 2 i j i

−1 n X ∆(ˆ α) = kx − x kα, 2 i j 1=i

K X gCov = ∆(ˆ α) − pˆk∆ˆ k(α). k=1

Value gCov returns the sample Gini distance covariance

References Dang, X., Nguyen, D., Chen, Y. and Zhang, J., (2019). A new Gini correlation between quantitative and qualitative variables, Journal of the American Statistical Association (submitted), https:// arxiv.org/pdf/1809.09793.pdf

See Also gCor gmd KgCov KgCor

Examples x <- iris[,1:4] y <- unclass(iris[,5]) gCov(x, y, alpha = 1)

gmd Gini Mean Difference

Description Computes Gini mean difference of x, where alpha is an exponent on the Euclidean distance and return the Gini mean difference. The default value for alpha is 1.

Usage gmd(x, alpha) gmd 11

Arguments x data alpha exponent on Euclidean distance, in (0,2)

Details gmd compute Gini mean difference of data. It is a self-contained R function dealing with both univariate and multivariate data. The samples must not contain missing values. alpha if missing by default is 1, otherwise it is exponent on the Euclidean distance. Gini mean difference (GMD) was originally introduced as an alternative measure of variability to the usual standard deviation (Gini14, Yitzhaki13). Let X and X0 be independent random variables from a univariate distribution F with finite first in R. The GMD of F is ∆ = ∆(X) = ∆(F ) = E|X − X0|,

the expected distance between two independent random variables. If the sample data x = {x1, x2, ..., xn} is available, the sample Gini mean difference is calculated by ˆ n−1 P n−1 Pn ∆ = 2 1≤i

where x(1) ≤ x(2) ≤ · · · ≤ x(n) are the order statistics of x (Schechtman87). The computation complexity for univariate Gini Mean difference is O(n log n). Gini mean difference has been generalized for multivariate distributions (Koshvoy97) That is, the Gini mean difference of a distribution F in Rd is ∆ = EkX − X0k, or even more generally for some α ∈ (0, 2), ∆(α) = EkX − X0kα, where kxk is the Euclidean norm. The sample Gini mean difference is computed by ˆ n−1 P α ∆(α) = 2 1≤i

Value gmd returns the sample Gini mean distance.

References Gini, C. (1914). Sulla misura della concentrazione e della variabilita dei caratteri. Atti del Reale Is- tituto Veneto di Scienze, Lettere ed Aeti, 62, 1203-1248. English Translation: On the measurement of concentration and variability of characters (2005). Metron, LXIII(1), 3-38. Koshevoy, G. and Mosler, K. (1997). Multivariate Gini indices. Journal of Multivariate Analysis, 60, 252-276. Schechtman, E. and Yitzhaki, S. (1987). A measure of association based on Gini’s mean difference. Communication in Statistics-Theory and Methods, 16 (1), 207-231. Yitzhaki, S. and Schechtman, E. (2013). The Gini Methodology, Springer, New York. 12 KdCor

See Also RcppGmd gCov gCor

Examples n = 100 x <- runif(n)

t0 = proc.time() gmd(x, alpha=1) proc.time()- t0

t1 = proc.time() gmd(x, alpha=0.5) proc.time()- t1

x <- matrix(runif(n), n/2, 2) gmd(x,alpha=1)

KdCor Kernel Distance Correlation Statistics

Description Computes Kernel distance correlation statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and returns the measures of dependence.

Usage KdCor(x, y, sigma)

Arguments x data y label of data or univariate response variable sigma kernel standard deviation

Details KdCor compute distance correlation statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. The kernel distance correlation is defined as follow.

dCov (X,Y ) dCor (X,Y ) = κX ,κY κX ,κY p p dCovκX ,κX (X, X) dCovκY ,κY (Y,Y ) KdCov 13

where

0 0 0 0 dCovκX ,κY (X,Y ) = EdκX (X,X )dκY (Y,Y ) + EdκX (X,X )EdκY (Y,Y ) 0 0 0 0 −2E [EX dκX (X,X )EY dκY (Y,Y )] .

Value KdCor returns the sample kernel distance correlation

References Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of Distance- based and RKHS-based Statistics in Hypothesis Testing, The Annals of Statistics, 41 (5), 2263- 2291. Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted).

See Also KgCov KgCor dCor

Examples x<-iris[,1:4] y<-unclass(iris[,5]) KdCor(x, y, sigma=1)

KdCov Kernel Distance Covariance Statistics

Description Computes Kernel distance covariance statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and returns the measures of dependence.

Usage KdCov(x, y, sigma)

Arguments x data y label of data or univariate response variable sigma kernel standard deviation 14 KdCov

Details KdCov compute distance correlation statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels. Distance covariance was introduced in (Szekely07) as a dependence measure between random vari- p q ables X ∈ R and Y ∈ R . If X and Y are embedded into RKHS’s induced by κX and κY , respectively, the generalized distance covariance of X and Y is (Sejdinovic13):

0 0 0 0 dCovκX ,κY (X,Y ) = EdκX (X,X )dκY (Y,Y ) + EdκX (X,X )EdκY (Y,Y ) 0 0 0 0 −2E [EX dκX (X,X )EY dκY (Y,Y )] .

In the case of Y being categorical, one may embed it using a set difference kernel κY ,

 1 if y = y0, κ (y, y0) = 2 Y 0 otherwise.

This is equivalent to embedding Y as a simplex with edges of unit length (Lyons13), i.e., Lk is

represented√ by a K dimensional vector of all zeros except its k-th dimension, which has the value 2 0 0 2 . The distance induced by κY is called the set distance, i.e., dκY (y, y ) = 0 if y = y and 1 otherwise. Using the set distance, we have the following results on the generalized distance covariance between a numerical and a categorical .

K X 2  0 0  dCovκX ,κY (X,Y ) := dCovκX (X,Y ) = pk 2EdκX (Xk,X) − EdκX (Xk,Xk ) − EdκX (X,X ) . k=1

Value KdCov returns the sample kernel distance correlation

References Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of Distance- based and RKHS-based Statistics in Hypothesis Testing, The Annals of Statistics, 41 (5), 2263- 2291. Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted).

See Also KgCov KgCor dCov

Examples x<-iris[,1:4] y<-unclass(iris[,5]) KdCov(x, y, sigma=1) KgCor 15

KgCor Kernel Gini Distance Correlation Statistics

Description Computes Kernel Gini distance correlation statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation, alpha is an exponent on the Euclidean distance and returns the kernel Gini mean difference.

Usage KgCor(x, y, sigma)

Arguments x data y label of data or univariate response variable sigma kernel standard deviation

Details Kgcor compute kernel Gini distance correlation statistics for data. It is a self-contained R func- tion dealing with both univariate and multivariate data. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels.

Gini distance correlation are generalized to RKHS, Hκ, as

PK  0 0  k=1 pk 2Edκ(Xk,X) − Edκ(Xk,Xk ) − Edκ(X,X ) gCorκ(X,Y ) = 0 . Edκ(X,X ) In this case, we use the default Gaussian distance function

q |x−x0|2 0 − q dκ(x, x ) = 1 − e σ2 ,

|x−x0|2 − q 0 1 σ2 induced by a weighted Gaussian kernel, κ(x, x ) = 2 e .

Value KgCor returns the sample Kernel Gini distance correlation between x and y.

References Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted), https://arXiv.org/pdf/1906.02171.pdf 16 KgCov

See Also gCov gCor dCor

Examples x<-iris[,1:4] y<-unclass(iris[,5]) KgCor(x, y, sigma=1)

KgCov Kernel Gini Distance Covariance Statistics

Description Computes Kernel Gini distance covariance statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and returns the kernel Gini covariance.

Usage KgCov(x, y, sigma)

Arguments x data y label of data or univariate response variable sigma kernel standard deviation

Details Kgcov compute kernel Gini distance covariance statistics for data. It is a self-contained R func- tion dealing with both univariate and multivariate data. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels.

Gini distance covariance are generalized to reproducing kernel (RKHS), Hκ, as

K X  0 0  gCovκ(X,Y ) = pk 2Edκ(Xk,X) − Edκ(Xk,Xk ) − Edκ(X,X ) , k=1

In this case, we use the default Gaussian distance function

q |x−x0|2 0 − q dκ(x, x ) = 1 − e σ2 ,

|x−x0|2 − q 0 1 σ2 induced by a weighted Gaussian kernel, κ(x, x ) = 2 e . Kgmd 17

Value KgCov returns the sample Kernel Gini distance covariance of x and y.

References Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted), https://arXiv.org/pdf/1906.02171.pdf

See Also gCov gCor dCor

Examples x<-iris[,1:4] y<-unclass(iris[,5]) KgCov(x, y, sigma=1)

Kgmd Kernel Gini Mean Difference Statistics

Description Computes Kernel Gini mean difference statistics, in which Xs are quantitative, sigma is kernel standard deviation, alpha is an exponent on the Euclidean distance and returns the kernel Gini mean difference.

Usage Kgmd(x, sigma)

Arguments x data sigma kernel standard deviation

Details Kgmd compute kernel Gini mean difference statistics for data. It is a self-contained R function dealing with both univariate and multivariate data. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Argument x, is treated as data. based statistics naturally generalizes from a Euclidean space to metric spaces (Lyons13). By using a positive definite kernel (Mercer kernel) (Mercer1909), distributions are mapped into a RKHS (Smola07) with a kernel induced distance. Hence one can extend energy dis- tances to a much richer family of statistics defined in RKHS (Sejdinovic13). Let κ : Rq × Rq → R 18 PermutationTest

q be a Mercer kernel (Mercer1909). There is an associated RKHS Hκ of real functions on R with q q reproducing kernel κ, where the function d : R × R → R defines a distance in Hκ,

0 p 0 0 0 dκ(x, x ) = κ(x, x) + κ(x , x ) − 2κ(x, x ).

Here Kgcov is defined as Gini distance covariance between x and rank(x).

Value Kgmd returns the sample Kernel Gini distance

References Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41 (5), 3284- 3305.

See Also gCov gCor dCor

Examples x<-iris[,1] Kgmd(x, sigma=1)

PermutationTest Permutation test of dependence between X and Y using (Gini) distance covariance or correlation statistics

Description Perform permutation test using various dependence measures, in which Xs are quantitative, Y are categorical, alpha is an exponent on Euclidean distance, sigma is kernel parameter in kernel methods and return the test statistic, critical value, p-value and decision of the test.

Usage PermutationTest(x, y, method, sigma, alpha, M = 200, level = 0.05)

Arguments x data y label of data or univariate response variable method name of permutation test method and is chosen from one of the method list: dCov, dCor, KdCov, KdCor, gCov, gCor, KgCov, Kgcor sigma kernel parameter for kenerl methods alpha exponent on Euclidean distance, in (0,2), the default value = 1 M number of permutations level significance level of the test, the default value = 0.05 RcppgCor 19

Details

H0 : X and Y are independent ⇐⇒ H0 : F (x|y = 1) = F (x|Y = 2) = ... = F (x|Y = K) PermutationTest compute the p-value value of a permutation test of a (Gini) distance covariance or correlation statistics. It is a self-contained R function the measure of dependence statistics. The p-value is obtained by a permutation procedure. Let ρˆ(ν) be the sample dependnce measure based on the orginal sample indexed by ν = {1, 2, ..., n}. Let π(ν) denote a permutation of the elements of ν and the corresponding ρˆ(π) is computed for the permutated data on y labels. Under the H0, ρˆ(ν) and ρˆ(π) are identically distributed for every permutation π of ν. Hence, based on M permutations, the critical value qγ is estimated by the (1 − γ)100% sample quantile of ρˆ(πm), m = 1, ..., M and the p-value is estimated by the proportion of ρˆ(πm) greater than ρˆ(ν). Usually 100 ≤ M ≤ 1000 is sufficient for a good estimation on the critical value or p-value. The default value is M = 200.

Value PermutationTest returns the p-value, critical value and decision of the permutation test of a spec- ified method.

See Also gCor gCov dCor dCov KgCov KgCov KdCov

Examples n = 50 x <- runif(n) y <- c(rep(1,n/2),rep(2,n/2)) PermutationTest(x, y, method = "gCor", alpha = 2, M = 50 )

RcppgCor Gini Distance Correlation Statistics

Description Computes Gini distance correlation statistics, in which Xs are quantitative, Y are categorical, alpha is exponent on the Euclidean distance and returns the measures of dependence.

Usage RcppgCor(x, y, alpha)

Arguments x data y label of data or univariate response variable alpha exponent on Euclidean distance, in (0,2] 20 RcppgCov

Details RcppgCor compute Gini distance correlation statistic between x and y. It is a Rcpp version of gCor.

Value RcppgCor returns the sample Gini distance correlation

See Also RcppKgCov RcppKgCor RcppgCov

Examples x<-iris[,1:4] y<-unclass(iris[,5]) RcppgCor(x, y, alpha=2)

RcppgCov Gini Distance Covariance Statistics

Description Computes Gini distance covariance statistics, in which Xs are quantitative, Y are categorical, alpha is an exponent on Euclidean distance and returns the measures of dependence.

Usage RcppgCov(x, y, alpha)

Arguments x data y label of data or univariate response variable alpha exponent on Euclidean distance, in (0,2]

Details RcppgCov compute Gini distance covariance statistics. It is Rcpp version of gCov.

Value RcppgCov returns the sample Gini distance covariance

See Also RcppgCor RcppKgCov RcppKgCor RcppGmd 21

Examples

x<-iris[,1:4] y<-unclass(iris[,5]) RcppgCov(x, y, alpha=2)

RcppGmd Gini Mean Difference Statistics

Description

Computes Gini mean difference of x, where alpha is an exponent on the Euclidean distance and return the Gini mean difference. The default value for alpha is 1.

Usage

RcppGmd(x, alpha)

Arguments

x data alpha exponent on Euclidean distance, in (0,2]

Details

RcppGmd compute Gini mean difference statistics for data. It is a Rcpp version of gmd.

Value

RcppGmd returns the sample Gini mean difference of x.

See Also

RcppKgCov RcppgCor gCov gCor

Examples

n=1000 x<-runif(n) RcppGmd(x, alpha=1) 22 RcppKgCor

RcppKgCor Kernel Gini Distance Correlation Statistics

Description

Computes Kernel Gini distance correlation statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and return the kernel Gini mean difference.

Usage

RcppKgCor(x, y, sigma)

Arguments

x data y label of data or univariate response variable sigma kernel standard deviation

Details

RcppKgCor compute kernel Gini distance correlation statistics for data. It is Rcpp version of KgCor.

Value

RcppKgCor returns the sample Kernel Gini distance covariance

See Also

gCov gCor dCor

Examples

n=100 x<-runif(n) y<-c(rep(1,n/2),rep(2,n/2)) RcppKgCor(x, y, sigma=1) RcppKgCov 23

RcppKgCov Kernel Gini Distance Covariance Statistics

Description

Computes Kernel Gini distance covariance statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and return the kernel Gini mean difference.

Usage

RcppKgCov(x, y, sigma)

Arguments

x data y label of data or univariate response variable sigma kernel standard deviation

Details

RcppKgCov compute kernel Gini distance covariance statistics for data. It is Rcpp version of KgCov.

Value

RcppKgCov returns the sample Kernel Gini distance covariance

See Also

gCov gCor dCor

Examples

n=100 x<-runif(n) y<-c(rep(1,n/2),rep(2,n/2)) RcppKgCov(x, y, sigma=1) 24 RcppKGmd

RcppKGmd Kernel Gini Mean Difference Statistics

Description Computes Kernel Gini mean difference of X, sigma is the kernel parameter and returns the kernel Gini mean difference.

Usage RcppKGmd(x, sigma)

Arguments x data sigma kernel parameter for Gaussian kernel

Details RcppKGmd compute kernel Gini mean difference for data It is Rcpp version of Kgmd.

Value RcppKGmd returns the sample Kernel Gini distance

See Also gmd Kgmd

Examples x<-iris[,1] RcppKGmd(x, sigma=1) Index

∗Topic multivariate PermutationTest, 4,5 , 18 ConfidenceInterval,3 CriticalValue,4 RcppgCor, 19, 20, 21 dCor,5 RcppgCov, 20, 20 dCov,6 RcppGmd, 12, 21 gCor,8 RcppKgCor, 20, 22 gCov,9 RcppKgCov, 20, 21, 23 gmd, 10 RcppKGmd, 24 KdCor, 12 KdCov, 13 KgCor, 15 KgCov, 16 Kgmd, 17 PermutationTest, 18 RcppgCor, 19 RcppgCov, 20 RcppGmd, 21 RcppKgCor, 22 RcppKgCov, 23 RcppKGmd, 24 ∗Topic package GiniDistance-package,2

ConfidenceInterval,3 CriticalValue,4 dCor,5, 7, 13, 16–19, 22, 23 dCov, 6,6, 14, 19 gCor,8, 10, 12, 16–23 gCov, 9,9, 12, 16–23 GiniDistance (GiniDistance-package),2 GiniDistance-package,2 gmd, 9, 10, 10, 21, 24

KdCor, 6,7 , 12 KdCov, 6,7 , 13, 19 KgCor, 9, 10, 13, 14, 15, 22 KgCov, 9, 10, 13, 14, 16, 19, 23 Kgmd, 17, 24

25