IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 1 Estimating Feature-Label Dependence Using Gini

Silu Zhang, Xin Dang, Member, IEEE, Dao Nguyen, Dawn Wilkins, Yixin Chen, Member, IEEE

Abstract—Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini distance, an in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are explored: Gini distance and Gini . Unlike Pearson covariance and correlation, which do not characterize independence, the above Gini distance based measures define dependence as well as independence of random variables. The test statistics are simple to calculate and do not require probability . Uniform convergence bounds and asymptotic bounds are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I and Type II errors. Moreover, the probability of Gini distance covariance under-performing the distance covariance statistic in Type II error decreases to 0 exponentially with the increase of the size. Extensive experimental results are presented to demonstrate the performance of the proposed method.

Index Terms—Energy distance, feature selection, Gini distance covariance, Gini distance correlation, distance covariance, reproducing kernel Hilbert space, dependence test, supervised learning. !

1 INTRODUCTION

UILDING a prediction model from observations of fea- 1.1.1 Feature Relevance B tures and responses (or labels) is a well-studied problem in and statistics. The problem becomes The concept of relevance has been studied in many fields particularly challenging in a high dimensional feature space. outside machine learning and statistics [31]. In the context A common practice in tackling this challenge is to reduce the of feature selection, John et al. [37] defined feature relevance number of features under consideration, which is in general via a probabilistic interpretation where a feature and the achieved via feature combination or feature selection. response variable are irrelevant if and only if they are con- Feature combination refers to combining high dimen- ditionally independent given any subset of features. Follow- sional inputs into a smaller set of features via a lin- ing this definition, Nilsson et al. [53] investigated distribu- ear or nonlinear transformation, e.g. principal compo- tions under which an optimal classifier can be trained over a nent analysis (PCA) [32], independent component analysis minimal number of features. Although the above definition (ICA) [16], curvilinear components analysis [20], multidi- of relevance characterizes the statistical dependence, testing mensional scaling (MDS) [76], nonnegative matrix factor- the conditional dependence is in general a challenge for ization (NMF) [43], Isomap [75], locally linear embedding continuous random variables. (LLE) [58], Laplacian eigenmaps [6], stochastic neighbor Significant amount of efforts have been devoted to find- embedding (SNE) [30], etc. Feature selection, also known ing a good trade-off between theoretical rigor and practical as variable selection, aims at choosing a subset of features feasibility in defining dependence measures. Pearson Cor- arXiv:1906.02171v1 [cs.LG] 5 Jun 2019 that is “relevant” to the response variable [9], [37], [38]. The relation [55] based methods are among the most popular work proposed in this article belongs to feature selection. approaches. Stoppiglia et al. [67] proposed a method for Next, we review work most related to ours. For a more linear models using the squared cosine of the angle between comprehensive survey of this subject, the reader is referred a feature and the response variable. Wei and Billings [83] to [27], [28], [46], [85]. used a squared correlation function to measure dependence between features. Fan and Lv [23] proposed a correlation 1.1 Related Work method for variable selection within the context of linear models. The method was extended in [24] to With a common goal of improving the generalization per- handle feature selection in generalized linear models and formance of the prediction model and providing a better binary classification. Pearson correlation and its variations interpretation of the underlying process, all feature selection are in general straightforward to implement. However, it is methods are built around the concepts of feature relevance sensitive only to linear dependence between two variables. and feature redundancy. Specifically, Pearson correlation can be zero for dependent random variables. S. Zhang, D. Wilkins, and Y. Chen are with the Department of Computer • and Information Science. X. Dang and D. Nguyen are with the Depart- Other correlation measures have been developed to ad- ment of Mathematics, University of Mississippi, University, MS 38677, dress the above limitation, among which based USA. E-mail: szhang6,xdang,dxnguyen,dwilkins,yixin @olemiss.edu { } approaches have been investigated extensively [5]. In [35], IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 2 Kullback-Leibler (KL) divergence was used to define the into clusters of high within-group dependence and low distance between class conditional density functions. A between-group dependence. Redundant features are then feature subset is chosen to maximize the distance. In [54], removed iteratively from each cluster. Wang et al. [82] features were selected to maximize class separability, which adopted a two-step procedure where important (but re- was defined using a variation of the KL divergence. Javed dundant) features were chosen according to a relevance et al. [36] proposed a feature importance function based on measure, a forward selection search was then applied to KL distance between the joint distribution of a feature and reduce redundancy. In [7], feature redundancy is defined the response variable and their product distribution. In [78], via mutual information between two features. A maximum the dependence between a feature and the target variable is spanning tree is constructed to eliminate redundancy. Tuv defined as the ratio of the mutual information between them et al. [77] augmented with artificial features that were to the entropy of the target variable. Zhai et al. [91] applied independent of the response variable. All features were normalized mutual information to filter out noncontributing ranked by an extended random forest method [10]. Redun- features. Maji and Pal [48] introduced a mutual information dant features were then removed based on the ranking of measure for two fuzzy attribute sets. In [62], Sindhwani et al. artificial features. Wu et al. [84] applied a Markov blanket to presented a method that maximizes the mutual information remove redundant features from streaming data. between the class labels and classifier output. Naghibi et There are many methods that formulate feature selection al. [52] considered mutual information between features and as an optimization problem where redundancy reduction class labels in a parallel search algorithm. is implicitly achieved via optimizing an objective function. As mutual information is hard to evaluate, several al- For example, Shah et al. [61] integrated redundancy control ternatives have been suggested. Kwak and Choi discussed into learning a conjunction of decision stumps, for which the limitations of mutual information and proposed two theoretical performance bounds were derived. Chakraborty approximation methods [41], [42]. Chow and Huang intro- and Pal [13] introduced a learning algorithm for multi- duced an estimator for mutual information using pruned layer neural networks where the error function included a Parzen window estimator and quadratic mutual informa- penalty term associated with feature dependence. Feature tion [15]. Peng et al. presented a max-relevance criterion that redundancy is controlled by minimizing the error function. approximates mutual information between a feature vector Yang and Hu [86] proposed a method that selects a subset of and the class label with the sum of the mutual information feature by minimizing an estimated Bayes error. In [40], fea- between individual features and the label [56]. This avoided ture selection was combined with classifier learning under estimating high-dimensional joint probability. Lefakis and a maximum a posteriori (MAP) framework. A heavy-tailed Fleuret proposed a method to maximize bounds on and prior was adopted to promote feature sparsity. He et al. [29] approximations to the mutual information between the se- developed an algorithm based on Laplacian regularized lected features and the class label [44]. In [21], Ding et al. model. Features were chosen to minimize the proposed a robust copula dependence (RCD) measure based “size” of the and the expected prediction on the L1 distance of the coupla density from uniform. error. In [17], the parameters of a mixture model, the number Theoretical supports were provided to show that RCD is of components, and saliency of features were optimized easier to estimate than mutual information. simultaneously. Zhao et al. [90] proposed a method to select The above divergence based approaches treat linear and features that preserve sample similarity. The method is able nonlinear dependence under the same framework. There to remove redundant features. has been researched on tackling nonlinear dependence via Class separation has been widely used as an objective linearization. For example, Song et al. used Hilbert-Schmidt function in redundancy reduction. Bressan and Vitria` [11] Independence Criterion (HISC) to characterize the feature- showed that under the assumption of class-conditional inde- label dependence. HISC defines cross-covariance in a re- pendence, class separability was simplified to a sum of one producing kernel Hilbert space (RKHS) [66]. Sun et al. [68] dimensional KL divergence. Feature selection is performed transformed a nonlinear problem into a set of locally linear via ICA. Wang [81] introduced a class separability criterion problems, for which feature weights were learned via mar- in RKHS. Feature selection is achieved by maximizing the gin maximization. Armanfard et al. [1] proposed a method criterion. Zhou et al. [92] showed that maximizing scatter- to associate an “optimal” feature subset to regions of the matrix based class separability may select discriminative sample space. Yao et al. [87] introduced a feature sparsity yet redundant features. They proposed a redundancy con- score that measures the local structure of each feature. strained optimization to address the issue. Cheng et al. [14] proposed to select the most discriminative feature for classi- 1.1.2 Feature Redundancy fication problems. This led to an optimization problem with Although one may argue that all features dependent on the maximizing the class separations as the objective. response variable are informative, redundant features un- Optimal feature subset selection was investigated under necessarily increase the dimensionality of the learning prob- certain optimization formulations. For example, in [65] a lem, hence may reduce the generalization performance [88]. special class of monotonic feature selection criterion func- Eliminating feature redundancy is, therefore, an essential tions was considered. Branch and Bound algorithms were step in feature selection [56]. developed to search for an optimal feature subset. Under Several methods were proposed to reduce redundancy general conditions, however, a popular practice is to include explicitly via a feature dependence measure. Mitra et al. [51] a regularization term in the optimization to control the formulated feature selection as a feature clustering problem. sparsity of the solution. Mao and Tsang [49] proposed a Through a dependence measure, features are partitioned weighted 1-norm regularizer that favors sparse solutions. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 3 Damodaran et al. [18] used HISC in a LASSO optimization easier to derive theoretical performance bounds on framework. Wang et al. [79] investigated sparsity regulariza- the tests. tion for feature selection under an online learning scenario Simple dependence tests. Gini distance statistics are • where a small and fixed number of features were main- simple to calculate. We prove that when there is tained. Feature selection for multimodal data was studied dependence between the feature vector and the re- in [80]. Multimodal data was projected to a common sub- sponse variable, the probability of Gini distance co- space. The projection and feature selection were obtained statistic under-performing distance covari- simultaneously by solving an optimization problem. Zeng ance statistic approaches 0 with the growth of the and Cheung [89] applied a sparse-promoting penalty in sample size. learning feature weights in an unsupervised setting. Ren Uniform convergence bounds and asymptotic analysis. • et al. [57] investigated feature selection using a generalized Under the bounded kernel assumption, we derive LASSO model. In [3] the idea of annealing was combined uniform convergence bounds for both Type I and with feature selection using a sparsity favoring loss func- Type II errors. Compared with distance covariance tion. The method gradually removes the most irrelevant and distance correlation statistics, the bounds for features. Gini distance statistics are tighter. Asymptotic analy- sis is also presented.

1.2 An Overview of the Proposed Approach 1.3 Outline of the Paper For problems of large scale (large sample size and/or The remainder of the paper is organized as follows: Section 2 high feature dimension), feature selection is commonly per- motivates Gini distance covariance and Gini distance corre- formed in two steps. A subset of candidate features are first lation from energy distance. We then extend them to RKHS identified via a screening [23] (or a filtering [28]) process and present a connection between generalized Gini distance based upon a predefined “importance” measure that can be covariance and generalized distance covariance. Section 3 calculated efficiently. The final collection of features are then provides estimators of Gini distance covariance and Gini chosen from the candidate set by solving an optimization distance correlation. Dependence tests are developed using problem. Usually, this second step is computationally more these estimators. We derive uniform convergence bounds expensive than the first step. Hence for problems with for both Type I and Type II errors of the dependence tests. very high feature dimension, identifying a subset of “good” In Section 3.3 we present connections with dependence tests candidate features, thus reducing the computational cost of using distance covariance. Asymptotic results are given in the subsequent optimization algorithm, is essential. Section 3.4. We discuss several algorithmic issues in Sec- The work presented in this article aims at improving the tion 4. In Section 5, we explain the extensive experimental feature screening process via a new dependence measure. studies conducted and demonstrate the results. We conclude Szekely´ et al. [71], [72] introduced distance covariance and and discuss possible future work in Section 6. distance correlation, which extended the classical bivari- ate product- covariance and correlation to random vectors of arbitrary dimension. Distance covariance (and 2 GINI DISTANCE COVARIANCE AND CORRELA- distance correlation) characterizes independence: it is zero TION if and only if the two random vectors are independent. In this section, we first present a brief review of the energy Moreover, the corresponding statistics are simple to calcu- distance. As an instance of the energy distance, Gini distance late and do not require estimating the distribution function covariance is introduced to measure dependence between of the random vectors. These properties make distance numerical and categorical random variables. Gini distance covariance and distance correlation particularly appealing covariance and correlation are then generalized to reproduc- to the dependence test, which is a crucial component in ing kernel Hilbert spaces (RKHS) to facilitate convergence feature selection [8], [45]. analysis in Section 3. Connections with distance covariance Although distance covariance and distance correlation are also discussed. can be extended to handle categorical variables using a space embedding [47], Gini distance covariance and Gini distance correlation [19] provide a natural alternative 2.1 Energy Distance to measuring dependence between a numerical random Energy distance was first introduced in [4], [69], [70] as vector and a categorical . In this article, a measure of between two probability we investigate selecting informative features for supervised distributions with finite first order moments. The energy learning problems with numerical features and a categorical distance between the q-dimensional independent random response variable using Gini distance covariance and Gini variables X and Y is defined as [73] distance correlation. The contributions of this paper are (X,Y ) = 2E X Y q E X X0 q E Y Y 0 q, (1) given as follows: E | − | − | − | − | − | q where q is the Euclidean norm in R , E X q +E Y q < , Generalized Gini distance covariance and Gini distance |·| | | | | ∞ • X0 is an iid copy of X, and Y 0 is an iid copy of Y . We extend Gini distance covariance and correlation. Energy distance has many interesting properties. It is Gini distance correlation to RKHS via positive def- scale equivariant: for any a R, inite kernels. The choice of kernel not only brings ∈ flexibility to the dependence tests, but also makes it (aX, aY ) = a (X,Y ). E | |E IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 4

q q It is rotation invariant: for any rotation matrix R R × The resulting measure is called Gini distance correlation, ∈ denoted by gCor(X,Y ), which is defined as (RX, RY ) = (X,Y ). E E gCor(X,Y ) Test statistics of an energy distance are in general relatively K pk 2E Xk X q E Xk Xk0 q E X X0 q simple to calculate and do not require density estimation = k=1 | − | − | − | − | − | . (Section 3). Most importantly, as shown in [70], if ϕX and E X X0 q P  | − |  ϕY are the characteristic functions of X and Y , respectively, (4) the energy distance (1) can be equivalently written as provided that E X q + E Xk q < and F is not a de- 2 | | | | ∞ [ϕX (x) ϕY (x)] generate distribution. Gini distance correlation satisfies the (X,Y ) = c(q) −q+1 dx, (2) following properties [19] . E q x q ZR | | 1) 0 gCor(X,Y ) 1. where c(q) > 0 is a constant only depending on q. Thus ≤ ≤ 0 with equality to zero if and only if X and Y are 2) gCor(X,Y ) = 0 if and only if X and Y are inde- identicallyE ≥ distributed. The above properties make energy pendent. distance especially appealing to testing identical distribu- 3) gCor(X,Y ) = 1 if and only if Fk is a single point tions (or dependence). mass distribution almost surely for all k = 1, ..., K. 4) gCor(aRX + b, Y ) = gCor(X,Y ) for all a = 0, q q q6 b R , and any orthonormal matrix R R × . 2.2 Gini Distance Covariance and Gini Distance Corre- ∈ ∈ lation Property 2 are especially useful in testing dependence. Gini distance covariance was proposed in [19] to measure dependence between a numerical random variable X Rq 2.3 Gini Distance Statistics in RKHS ∈ from function F (cumulative distribution function, CDF) Energy distance based statistics naturally generalizes from and a Y with K values L1, ..., LK . If a Euclidean space to metric spaces [47]. By using a pos- we assume the categorical distribution PY of Y is Pr(Y = itive definite kernel (Mercer kernel) [50], distributions are Lk) = pk and the conditional distribution of X given mapped into a RKHS [64] with a kernel induced distance. Y = Lk is Fk, the marginal distribution of X is Hence one can extend energy to a much richer q q K family of statistics defined in RKHS [59]. Let κ : R R × → F (x) = pkFk(x). R be a Mercer kernel [50]. There is an associated RKHS κ q H k=1 of real functions on R with reproducing kernel κ, where X q q the function d : R R R defines a distance in κ, When the conditional distribution of X given Y is the same × → H as the marginal distribution of X, X and Y are independent, dκ(x, x0) = κ(x, x) + κ(x0, x0) 2κ(x, x0). (5) i.e., there is no correlation between them. However, when − q they are dependent, i.e., F = Fk for some k, the depen- Hence Gini distance covariance and Gini distance correla- dence can be measured through6 the difference between the tion are generalized to RKHS, , as Hκ marginal distribution F and conditional distribution F . k gCov (X,Y ) This difference is measured by Gini distance covariance, κ K gCov(X,Y ), which is defined as the expected weighted L2 = pk 2Edκ(Xk,X) Edκ(Xk,Xk0) Edκ(X,X0) , distance between characteristic functions of the conditional − − k=1 and marginal distribution (if the expectation is finite): X   (6) K 2 [ϕk(x) ϕ(x)] gCorκ(X,Y ) gCov(X,Y ) := c(q) pk −q+1 dx, K q R x q k=1 pk 2Edκ(Xk,X) Edκ(Xk,Xk0) Edκ(X,X0) kX=1 Z | | = − − . where c(q) is the same constant as in (2), ϕ and ϕ are P  Edκ(X,X0)  k (7) the characteristic functions for the conditional distribution Fk and marginal distribution F , respectively. It follows The choice of kernels allows one to design various tests. In immediately that gCov(X,Y ) = 0 mutually implies inde- this paper, we focus on bounded translation and rotation pendence between X and Y . Based on (1) and (2), the Gini invariant kernels. Our choice is based on the following distance covariance is clearly a weighted energy distance, considerations: hence can be equivalently defined as 1) The boundedness of a positive definite kernel im- gCov(X,Y ) plies the boundedness of the distance in RKHS, K which makes it easier to derive strong (exponential) = pk 2E Xk X q E Xk Xk0 q E X X0 q , convergence inequalities based on bounded devia- | − | − | − | − | − | tions (discussed in Section 3); kX=1  (3) 2) Translation and rotation invariance is an important property to have for testing of dependence. where (Xk,Xk0) and (X,X0) are independent pair variables q from Fk and F , respectively. Same as in R , Gini distance covariance and Gini dis- Gini distance covariance can be standardized to have a tance correlation in RKHS also characterize independence, of [0, 1], a desired property for a correlation measure. i.e., gCovκ(X,Y ) = 0 and gCorκ(X,Y ) = 0 if and only if IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 5 X and Y are independent. This is derived as the following Let and be the RKHS induced by κ and the from the connection between Gini distance covariance and set differenceX kernelY (9), respectively, with the associated distance covariance in RKHS. Distance covariance was in- distance metrics defined according to (5). and are troduced in [71] as a dependence measure between random both separable Hilbert spaces [2], [12] as theyX each haveY a variables X Rp and Y Rq. If X and Y are embedded countable set of orthonormal basis [50]. Hence and are ∈ ∈ X Y into RKHS’s induced by κX and κY , respectively, the gen- of strong negative type (Theorem 3.16 in [47]). Because the eralized distance covariance of X and Y is [59]: metrics on and are bounded, the marginals of (X,Y ) on Xhave finiteY first moment in the sense defined dCovκX ,κY (X,Y ) X × Y in [47]. Therefore, dCovκ(X,Y ) = 0 implies that X and = EdκX (X,X0)dκY (Y,Y 0) + EdκX (X,X0)EdκY (Y,Y 0) Y are independent (Theorem 3.11 [47]).

2E [EX0 dκX (X,X0)EY 0 dκY (Y,Y 0)] . (8) Finally, the proof for gCorκ(X,Y ) follows from the − above and the condition that X is not degenerate. In the case of Y being categorical, one may embed it using a set difference kernel κY , In the remainder of the paper, unless noted otherwise, 2 1 we use the default distance function 2 if y = y0, κY (y, y0) = (9) x x 2 | − 0|q 0 otherwise. 2  dκ(x, x0) = 1 e− σ , This is equivalent to embedding Y as a simplex with edges q − induced by a weighted Gaussian kernel, κ(x, x0) = of unit length [47], i.e., Lk is represented by a K dimensional 2 x x0 1 | − |q vector of all zeros except its k-th dimension, which has the e− σ2 . It is immediate that the above distance function √2 2 value 2 . The distance induced by κY is called the set is translation and rotation invariant and is bounded with distance, i.e., dκY (y, y0) = 0 if y = y0 and 1 otherwise. the range [0, 1). Moreover, using Taylor expansion, it is not Using the set distance, we have the following results on the difficult to show that gCorκ approaches gCor when the generalized distance covariance between a numerical and a kernel parameter σ approaches . categorical random variable. ∞ q Lemma 1. Suppose that X R is from distribution F and Y is 3 DEPENDENCE TESTS ∈ a categorical variable with K values L1, ..., LK . The categorical We first present an unbiased estimator of the generalized distribution PY of Y is P (Y = Lk) = pk and the conditional Gini distance covariance. Probabilistic bounds for large de- distribution of X given Y = Lk is Fk, the marginal distribution viations of the empirical generalized Gini distance covari- K q q of X is F (x) = pkFk(x). Let κX : R R R be k=1 × → ance are then derived. These bounds lead directly to two a Mercer kernel and κY a set difference kernel. The generalized P dependence tests. We also provide discussions on connec- distance covariance dCovκX ,κY (X,Y ) is equivalent to tions with the dependence test using generalized distance covariance. Finally, asymptotic analysis of the test statistics dCovκX ,κY (X,Y ) := dCovκX (X,Y ) K is presented. 2 = p 2EdκX (Xk,X) EdκX (Xk,Xk0) EdκX (X,X0) . k − − k=1 3.1 Estimation X  (10) In Section 2.2, Gini distance covariance and Gini distance From (6) and (10), it is clear that the generalized Gini correlation were introduced from an energy distance point covariance is always larger than or equal to the generalized of view. An alternative interpretation based on Gini distance covariance under the set difference kernel and the difference was given in [19]. This definition yields simple 1 same κX , i.e., point estimators. Let X Rq be a random variable from distribution gCovκX (X,Y ) dCovκX (X,Y ) (11) ∈ ≥ F . Let Y Y = L1, ,LK be a categorical random where they are equal if and only if both are 0, i.e., X and variable with∈ K values{ ··· and Pr(}Y = L ) = p (0, 1). k k ∈ Y are independent. This yields the following theorem. The The conditional distribution of X given Y = Lk is Fk. Let proof of Lemma 1 is given in Appendix ??. (X,X0) and (Xk,Xk0) be independent pair variables from F and F , respectively. The Gini distance covariance (3) and Theorem 2. For any bounded Mercer kernel κ : Rq Rq R, k × → Gini distance correlation (4) can be equivalently written as gCovκ(X,Y ) = 0 if and only if X and Y are independent. The K same result holds for gCorκ(X,Y ) assuming that the marginal distribution of X is not degenerate. gCov(X,Y ) = ∆ pk∆k, (12) − kX=1 Proof. The proof of the sufficient part for gCovκ(X,Y ) is K ∆ pk∆k immediate from the definition (6). The inequality (11) sug- gCor(X,Y ) = − k=1 , (13) ∆ gests that dCovκ = 0 when gCovκ = 0. Hence the proof of P the necessary part is complete if we show that dCovκ = 0 where ∆ = E X X0 q and ∆k = E Xk Xk0 q are the | − | | − q | implies independence. This is proven as the following. Gini mean difference (GMD) of F and Fk in R [25], [26],

1. The inequality holds for Gini covariance and distance covariance 2. Since any bounded translation and rotation invariant kernels can as well, i.e., gCov(X,Y ) dCov(X,Y ) where X Rd and Y is be normalized to define a distance function with the maximum value ≥gCov (X,Y ) gCov ∈(X,Y ) categorical. The notations of κX and κ are used no greater than 1, the results in Section 2.3 and Section 3 hold for these interchangeably with both κX and κ representing a Mercer kernel. kernels as well. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 6

n t [39], respectively. This suggests that Gini distance covari- The null hypothesis is rejected when gCovκ cn− where 1 ≥ ance is a measure of between-group variation and Gini c > 0 and t 0, 2 . Next we establish upper bounds for distance correlation is the ratio of between-group variation the Type I and∈ Type II errors of the above dependence test.  and the total Gini variation. Replacing q with dκ( , ) in (12) and (13) yields the GMD version of| (6) · | and (7). · · Corollary 6. Under the conditions of Theorem 4, the following inequalities hold for any c > 0 and t 0, 1 : Given an iid sample data = ∈ 2 q D (xi, yi) R Y : i = 1, , n , let k be the index  2 1 2t { ∈ × ··· } I n t c n − set of sample points with yi = Lk. The probability pk is Type I : Pr gCov cn− H exp , κ ≥ | 0 ≤ − 12.5 estimated by the sample proportion of category k, i.e.,   nk   (18) pˆk = where nk = k > 2. The point estimators of n |I | 2 1 2t the generalized Gini distance covariance and Gini distance n t c n − Type II : Pr gCov cn− H1 exp . correlation for a given kernel κ are κ ≤ | ≤ − 12.5   K   (19) n ˆ ˆ gCovκ := ∆ pˆk∆k, (14) t − Proof. Let  = cn− . The Type I bound is immediate from k=1 XK Theorem 4. The Type II bound is derived from the following ∆ˆ pˆk∆ˆ k gCorn := − k=1 , (15) inequality and Theorem 4. κ ˆ P∆ n t where Pr gCov cn− H1 κ ≤ | 1 t n t − Pr cn− gCov + gCov (X,Y ) 2cn− 0 H1 nk ≤  − κ  κ − ≥ | ∆ˆ k = dκ(xi, xj), (16) n t 2 = Pr gCovκ(X,Y ) gCovκ cn− H1 .  ! i 0 Proof. Clearly, ∆k and ∆ are unbiased because they are U- 1 ≥ ˆ ˆ and t 0, . Type I and Type II bounds are presented as statistics of size 2. Also pˆk∆k is unbiased since E[ˆpk∆k] = ∈ 4 ˆ nk below. EE[ˆpk∆k nk] = E[ n ∆k] = pk∆k. This leads to the unbi-  | n asedness of gCovκ. Corollary 7. Under the conditions of Theorem 4 and Theorem 5 t where additionally ∆ 2n− , the following inequalities hold for ≥1 3.2 Uniform Convergence Bounds any c > 0 and t 0, : ∈ 4 We derive two probabilistic inequalities, from which de- n  t pendence tests using point estimators (14) and (15) are Type I : Pr gCorκ cn− H0 ≥2 1 4|t 1 2t established. c n − n − exp + exp , (20) q ≤ − 12.5 − 2 Theorem 4. Let = (xi, yi) R Y : i = 1, , n be an     D { ∈ × ··· } 2 1 2t iid sample of (X,Y ) and κ a Mercer kernel over Rq Rq that n t c n − × Type II : Pr gCorκ cn− H1 exp . (21) induces a distance function d ( , ) with bounded range [0, 1). ≤ | ≤ − 12.5 κ · ·   For every  > 0,   Proof. From (15), we have 2 n n Pr [gCovκ gCovκ(X,Y ) ] exp − , and n t − ≥ ≤ 12.5 Pr gCorκ cn− H0  2  ≥ | n n n 2t ˆ t Pr [gCov (X,Y ) gCov ] exp − . Pr gCovκ cn−  OR ∆ n− H0 κ − κ ≥ ≤ 12.5 ≤ ≥ ≤ |   h n 2t it Pr gCov cn− H + Pr ∆ˆ n− H Theorem 5. Under the condition of Theorem 4, for every  > 0 ≤ κ ≥ | 0 ≤ | 0 2 n 2t h it n Pr gCov cn− H0 + Pr ∆ ∆ˆ n− H0 . Pr[∆ˆ ∆ ] exp − , and ≤ κ ≥ | − ≥ | − ≥ ≤ 2     h i 2 Let  = cn 2t and  = n t. The Type I bound is derived ˆ n 1 − 2 − Pr[∆ ∆ ] exp − . from Theorem 4 and Theorem 5. The boundedness of d ( , ) − ≥ ≤ 2 κ · ·   implies that ∆ˆ < 1. Therefore, Proofs of Theorem 4 and Theorem 5 are given in Ap- pendix ??. Next we consider a dependence test based on n t Pr gCorκ cn− H1 gCovn. Theorem 2 shows that gCov (X,Y ) = 0 mutually ≤ | κ κ Pr gCovn cn t H implies that X and Y are independent. This suggests the  κ −  1 ≤ ≤ | n t Pr gCov (X,Y ) gCov cn− H . following null and alternative hypotheses: ≤  κ −  κ ≥ | 2   H0 : gCovκ(X,Y ) = 0, Hence the Type II bound is given by Theorem 4 with  = t t H : gCov (X,Y ) 2cn− , c > 0 and t > 0. cn− . 1 κ ≥ IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 7

3.3 Connections to Generalized Distance Covariance are tighter than those on the generalized distance covariance In Section 2.3, generalized Gini distance covariance is re- based dependence test. lated to generalized distance covariance through (11). Under To further compare the two dependence tests, we con- sider the following null and alternative hypotheses: the conditions of Lemma 1, dCovκX ,κY (X,Y ) = 0 if and only if X and Y are independent. Hence dependence tests H0 : S(X,Y ) = 0, similar to those in Section 3.2 can be developed using H1 : S(X,Y ) , > 0, empirical estimates of dCovκX ,κY (X,Y ). Next, we estab- ≥ T T lish a result similar to Theorem 4 for generalized distance where S(X,Y ) = gCovκX (X,Y ) or dCovκX (X,Y ) with n n covariance. We demonstrate that generalized Gini distance the corresponding test statistics Sn = gCovκX or dCovκX , covariance has a tighter probabilistic bound for large devia- respectively. The null hypothesis is rejected when Sn τ where 0 < τ . Note that this test is more general≥ tions than its generalized distance covariance counterpart. ≤ T Using the unbiased estimator for distance covariance than the dependence test discussed in Section 3.2, which is t t developed in [74], we generalize it to an unbiased es- a special case with = 2cn− and τ = cn− . Upper bounds on Type I errors followT immediately from (18) by replacing timator for dCovκX ,κY (X,Y ) defined in (8). Let = q p D cn t with τ. Type II error bounds, however, are more (xi, yi) R R : i = 1, , n be an iid sample from − { ∈ × ··· } difficult to derive due to the fact that τ = would make the joint distribution of X and Y . Let A = (aij) be a T symmetric, n n, centered kernel distance matrix of sample nonexistent. Next, we take a different approach by × establishing which one of gCovn and dCovn is less likely x1, , xn. The (i, j)-th entry of A is κX κX ··· to underperform in terms of Type II errors. 1 1 1 aij n 2 ai n 2 a j + (n 1)(n 2) a , i = j; Under the Aij = − − · − − · − − ·· 6 (0, i = j, H 0 : dCov (X,Y ) , > 0, 1 κX ≥ T T n n where aij = dκX (xi, xj), ai = j=1 aij, a j = i=1 aij, we compare two dependence tests: · · and a = n a . Similarly, using d (y , y ), a symmet- n i,j=1 ij κY i j accepting H10 when gCovκ τ, 0 < τ ; ·· P P • nX ≥ ≤ T ric, n n, centered kernel distance matrix is calculated for accepting H 0 when dCov τ, 0 < τ . × P • 1 κX samples y , , y and denoted by B = (b ). An unbiased ≥ n ≤ T 1 ··· n ij We call that “gCovn underperforms dCov ” if and only estimator of dCov (X,Y ) is given as κX κX κX ,κY if 1 gCovn < τ dCovn , dCovn = A B . (22) κX ≤ κX κX ,κY n(n 3) ij ij i=j i.e., the dependence between X and Y is detected by − X6 n n dCovκ but not by gCovκ . The following theorem demon- We have the following result on the concentration of X X n n strates an upper bound on the probability that gCovκX dCovκX ,κY around dCovκX ,κY (X,Y ). n underperforms dCovκX . q p Theorem 8. Let = (xi, yi) R R : i = 1, , n D { ∈ ×q q ··· } Theorem 9. Under H10 and conditions of Theorem 8, there exists be an iid sample of (X,Y ). Let κX : R R R and p p × → γ > 0 such that the following inequality holds for any > 0 and κY : R R R be Mercer kernels. dκX ( , ) and dκY ( , ) T × → · · · · 0 < τ : are distance functions induced by κX and κY , respectively. Both ≤ T n n distance functions have a bounded range [0, 1). For every  > 0, Pr gCovκX underperforms dCovκX H10 2 2 | 2  nγ nγ  n n exp − + exp − . Pr dCov dCovκX ,κY (X,Y )  exp − , ≤ 12.5 512 κX ,κY − ≥ ≤ 512       Proof. Lemma 1 implies that gCov (X,Y ) and  κX ≥ dCovκ (X,Y ) where the equality holds if and only if n2 X Pr dCov (X,Y ) dCovn  exp − . both are 0, i.e., X and Y are independent. Therefore, under κX ,κY − κX ,κY ≥ ≤ 512   H10, for any > 0 and 0 < τ , we define   T ≤ T The proof is provided in Appendix ??. Note that the gCov (X,Y ) dCovκ (X,Y ) γ = κX − X > 0. above result is established for both X and Y being numer- 2 K ical. When Y is categorical, it can be embedded into R It follows that using the set difference kernel (9). Therefore, in the follow- n n 0 ing discussion, we use the simpler notation introduced in Pr gCovκX underperforms dCovκX H1 n n | Lemma 1 where dCovκ ,κ is denoted by dCovκ . = Pr gCov < τ dCov H 0 X Y X  κX κX 1  n ≤ n | The upper bounds for generalized Gini distance covari- Pr gCov < dCov H 0  κX κX 1  ance is clearly tighter than those for generalized distance ≤ | Pr gCov (X,Y ) gCovn γ OR covariance. Replacing gCov (X,Y ) in H and H with  κX κX κ 0 1 ≤ n − ≥ dCov dCovκ (X,Y ) γ H10 dCovκX (X,Y ), one may develop dependence tests parallel  κX X − n≥ | to those in Section 3.2: reject the null hypothesis when Pr gCovκ (X,Y ) gCovκ γ H10 n X X  t 1 ≤ n − ≥ | dCovκX cn− where c > 0 and t 0, 2 . Upper bounds + Pr dCovκ dCovκX (X,Y ) γ H10 on Type≥ I and Type II errors can be∈ established in a result  X − ≥ |   nγ2 nγ2 similar to Corollary 6 with the only difference being replac- exp − + exp −  ≤ 12.5 512 ing the constant 12.5 with 512. Hence the bounds on the     generalized Gini distance covariance based dependence test where the last step is from Theorem 4 and Theorem 8. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 8 2 ˆ 3.4 Asymptotic Analysis with 0 mean and variance σv. With the result of b = ( pˆ , ..., pˆ , 1)T being a consistent estimator of b and by We now present asymptotic distributions for the proposed − 1 − K Gini covariance and Gini correlation. the Slutsky’s theorem, we have the same limiting normal n ˆT ˆ T ˆ 2 distribution for gCovκX = b δ as that of b δ. Therefore, Theorem 10. Assume E(dκ(X,X0)) < and pk > 0 for k = ∞ n n the result of (23) is proved. 1, ..., K. Under dependence of X and Y , gCov and gCor κX κX However, under the independence assumption, σ2 = 0 have the asymptotic normality property. That is, v because D √n(gCovn gCov (X,Y )) (0, σ2), (23) κX κX v h(x) − −→N 2 n D σv √n(gCor gCor (X,Y )) (0, ), (24) = 2 pk(Edκ(xk,X) Edκ(xk,Xk)) 2(∆ pk∆k) κX κX 2 − − − − −→N ∆ k k 2 X X where σv is given in the proof. 0, n n ≡ Under independence of X and Y , gCovκX and gCorκX converge in distribution, respectively, according to resulting from the same distribution of X and Xk. This corresponds to the degenerate case of U-statistics and bT δˆ 2 D ∞ has a mixture of χ distributions [60]. Hence the result of n(gCovn ) λ (χ2 1), (25) κX −→ l 1l − (25) holds. Xl=1 n D 1 ∞ 2 One way to use the results of (23) and (24) is to test n(gCor ) λl(χ 1), (26) κX −→ ∆ 1l − H0 based on the confidence interval approach. More specifi- Xl=1 cally, an asymptotically (1 α)100% confidence interval for − where λ1, ... are non-negative constants dependent on F and gCovκ (X,Y ) is 2 2 2 X χ11, χ12, ..., are independent χ1 variates. 2 n σˆv gCovκ (X,Y ) Z1 α/2 , Note that the boundedness of the positive definite kernel X − √ 2 ± n κ implies the condition of E(d (X,X0)) < . κ ∞ 2 2 where σˆv is a consistent estimator of σv and Z1 α/2 is the Proof. We focus on a proof for the generalized Gini distance 1 α/2 quantile of the standard normal random− variable. covariance and results for the correlation follow immedi- If− this interval does not contain 0, we can reject H at ˆ 0 ately from Slutsky’s theorem [63] and the fact that ∆ is a significance level α/2. This test controls Type II error to be consistent estimator of ∆. α/2. Let g(x) = Edκ(x, X0) Edκ(X,X0). With the U-statistic − On the other hand, if a test to control Type I error is theorem, we have preferred, we usually need to rely on a permutation test √n(∆ˆ ∆) D N(0, v2), rather than the results of (25) and (26) since λ’s depend − −→ on the distribution F , which is unknown. Details of the 2 2 2 where v = 4Eg (X) = 4 k pkEg (Xk). Similarly, let permutation test are in the next section. gk(x) = Edκ(x, Xk0 ) Edκ(Xk,Xk0 ) for k = 1, 2, ..., K and 2 2 − P w = 4Egk(Xk) . We have k 4 AN ALGORITHMIC VIEW D 2 √nk(∆ˆ k ∆k) N(0, w ), for k = 1, 2, ..., K. Although the uniform convergence bounds for generalized − −→ k Gini distance covariance and generalized Gini distance Since n = npˆ np in probability, by Slutsky’s theorem k k k correlation in Section 3.2 and Section 3.3 are established we are able to write→ this result as upon the bounded kernel assumption, all the results also ˆ D 2 2 2 √n(∆k ∆k) N(0, vk), where vk = wk/pk. hold for Gini distance covariance and Gini distance corre- − −→ lation if the features are bounded. This is because when Let Σ be the variance and covariance matrix for T the features are bounded, they can be normalized so that g˜ = 2(g1(X1), ..., gK (XK ), g(X)) , where X = Xk with sup x x = 1 T x,x0 0 q . probability pk. In other words, Σ = Eg˜g˜ . Denote The| calculation− | of test statistics (14) and (15) requires ˆ ˆ ˆ T ˆ T (∆1, ..., ∆K , ∆) as δ and (∆1, ..., ∆K , ∆) as δ. From the evaluating distances between all unique pairs of samples. D U-statistic theorem [33], we have √n(δˆ δ) N(0, Σ). Its time complexity is therefore Θ(n2), where n is the 2 − 2 −→2 Obviously, Σ has diagonal elements v1, ..., vK , v . Let b = sample size. In the one dimension case, i.e., q = 1, Gini T 3 ( p1, ..., pK , 1) be the gradient vector of gCovκX (X,Y ) distance statistics can be calculated in Θ(n log n) time [19] . − − 2 T with respect to δ. Then σv = b Σb > 0 under the assump- Note that distance covariance and distance correlation can tion of dependence of X and Y , since also be calculated in Θ(n log n) time [34]. Nevertheless, the implementation for Gini distance statistics is much simpler h(x) := bT g˜(x) = 2 p (g(x ) g (x )) k k − k k as it does not require the centering process. Xk Generalized Gini distance statistics are functions of the n n = 2 pk(Edκ(xk,X) Edκ(xk,Xk)) 2(∆ pk∆k) kernel parameter σ. Figure 1(a) shows gCov and gCor − − − κ κ Xk Xk of X1 and Y1 for n = 2000. The numerical random vari- = 0 able X is generate from a mixture of two dimensional 6 1 2 2 and σv = k pkE[h(Xk) ]. In this case, by the Delta 3. When the inner produce kernel κ(x, x ) = xT x is chosen, gener- T 0 0 method, √nb (δˆ δ) is asymptotically normally distributed alized Gini distance statistics reduces to Gini distance statistics. P − IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 9

T normal distributions: N1 [1, 2] , diag[2,.5] ,N2 Repeating the random permutation many times, we may [ 3, 5]T , diag[1, 1] , and∼N N [ 1, 2]T , diag[2, 2]∼. estimate the critical value for a given significance level 3  TheN − three− components have equal∼ mixing N − proportions. The α based on the statistics of the permuted data. A simple   categorical variable Y y , y , y is independent of X . approach is to use the defined by α, e.g., when 1 ∈ { 1 2 3} 1 The results in Figure 1(b) are calculated from X2 and Y2 for α = 0.05 the critical value is 95-th percentile of the test n = 2000. The numerical random variable X2 is generated statistic of the permuted data. The null hypothesis H0 is by Ni if and only if Y2 = yi, i = 1, 2, 3. The categorical rejected when the test statistic is larger than the critical 1 distribution of Y2 is Pr(Y2 = yi) = 3 . It is clear that X2 value. and Y2 are dependent on each other. Figure 1 shows the Figure 2 shows the permutation test results of data impact of kernel parameter σ on the estimated generalized generated from the same distributions used in Figure 1. n Gini distance covariance and Gini distance correlation. As The plots on the top are gCovκ and the critical values. n a result, this affects the Type I and Type II error bounds The plots at the bottom are gCorκ and the critical values. given in Section 3.2. In this example, under H0 (or H1), Test statistics are calculated from 500 samples. The critical n 2 the minimum (or maximum) gCovκ is achieved at σ = 50 values are estimated from 5000 random permutations at (or σ2 = 29). These extremes yield tightest bounds in (18) α = 0.05. As illustrated in Figure 2(a), when X and Y 4 n and (19) . Note that gCovκ is an unbiased estimate of are independent, the permutation tests do not reject H0 at n gCovκ. Although gCovκ can never be negative, gCovκ can significance level 0.05. Figure 2(b) shows that when X and be negative, especially under H0. Y are dependent, H0 is rejected at significance level 0.05 This example also suggests that in addition to the theo- by the permutation tests. It is interesting to note that the retical importance, the inequalities in (18) and (19) may be decision to reject or accept H0 is not influenced by the value directly applied to dependence tests. Given a desired bound of the kernel parameter σ. 5 (or significance level), α, on Type I and Type II errors, we call the value that determines whether H0 should be rejected 5 (hence to accept H1) the critical value of the test statistic. n We first compare Gini distance statistics with distance Based on (18) and (19), the critical value for gCovκ, cv(α, n), which is a function of α and the sample size n, is calculated statistics on artificial datasets where the dependent features as are known. We then provide comparisons on the MNIST 12.5 log 1 dataset, a breast cancer dataset and 15 UCI datasets. For cv(α, n) = α . these real datasets, we also include Pearson R2 as a method s n under comparison. The three horizontal dashed lines in Figure 1(b) illustrate the critical values for α = 0.01, α = 0.05, and α = 0.15, 5.1 Simulation Results respectively. The population Gini distance covariance esti- In this , we compare dependence tests using four mated using 20, 000 iid samples is not included in the figure n n n n n statistics, dCov , dCor , gCov , and gCor , on artificial because of its closeness to gCovκ. With a proper choice of κ κ κ κ datasets. The data were generated from three distribution σ, H1 should be accepted based on the 2000 samples of families: normal, exponential, and Gamma distributions (X2,Y2) with both Type I and Type II errors no greater than under both H (X and Y are independent) and H (X 0.05. Note that we could not really accept H1 at the level 0 1 and Y are dependent). Given a distribution family, we first α = 0.01 because the estimated maximum gCovκ is around 0.28, which is smaller than 0.3393 (two times the critical randomly choose a distribution F0 and generate n iid sam- value at α = 0.01). ples of X. Samples of the categorical Y are then produced The above test, although simple, has two limitations: independent of X. Repeating the process, we create a total of m independent datasets under H0. The test statistics of these Choosing an optimal σ is still an open problem. • m independent datasets are used in calculating the critical Numerical search is computationally expensive even values for different Type I errors. In the dependent case, X is if it is in one dimension; K produced by F = k=1 pkFk, a mixture of K distributions The simplicity of the distribution free critical value • where K is the number of different values that Y takes, cv(α, n) comes with a price: it might not be tight P pk is the probability that Y = yk, and Fk is a distribution enough for many distributions, especially when n is from the same family that yields the data under H0. The small. dependence between X and Y is established by the data Therefore, we apply permutation test [22], a commonly generating process: a sample of X is created by Fk if Y = yk. used statistical tool for constructing distributions, The mixture model is randomly generated, i.e., pk and Fk are both randomly chosen. For each Y = y (k = 1, ,K), to handle scenarios that the test based on cv(α, n) is not k ··· feasible. We randomly shuffle the samples of X and keep nk = n pk iid samples of X are produced following Fk. · K the samples of Y untouched. We expect the generalized This results in one data set of size n = k=1 nk under H1. Following the same procedure, we obtain m independent Gini distance statistics of the shuffled data should have P values close to 0 because the random permutation breaks data sets under H1 each corresponding to a randomly the dependence between samples of X and samples of Y . selected K-component mixture model F . The statistic, defined as 1 Type II error, is then estimated 4. The kernel parameter σ also affects the Type I and Type II error − n n bounds for gCorκ in (20) and (21). The Type I error bound for gCovκ is 5. Although the decision to reject or accept H0 is not affected by σ, n significantly tighter than that for gCorκ. the p-value of the test does vary with respect to σ. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 10

(a) (b)

Fig. 1: Estimates of the generalized Gini distance covariance and generalized Gini distance correlation for different kernel parameters using 2000 iid samples: (a) independent case; (b) dependent case. Three critical values are shown in (b). They are calculated for significance levels 0.01, 0.05, and 0.15, respectively. In terms of the uniform convergence bounds, the optimal value of the kernel parameter σ is defined by the minimizer (or maximizer) of the test statistics under H0 (or H1).

(a) (b)

Fig. 2: Permutation tests of the generalized Gini distance covariance and generalized Gini distance correlation for different kernel parameters using 200 iid samples and 5000 random permutations: (a) independent case; (b) dependent case. The 95 percentile curves define the critical values for α = 0.05. They are calculated from the permuted data. Test statistics higher (lower) than the critical value suggest accepting H1 (H0).

TABLE 1: Models of Different Distribution Families using these m independent data sets under H1 and the critical values computed from the m data sets under H0. p(x θ) θ In our experiments, n = 100 and m = 10, 000. Table 1 | (x µ)2 1 − summarizes the model parameters of the three distribution Normal e− σ2 µ (0, 5), σ (0, 5) √2πσ2 λx ∼ N ∼ U families. I( ) is the indicator function. (a, b) denotes the Exponential λe− I(x 0) λ (0, 5) · N α α 1 βx ≥ ∼ U with mean a and Gamma β x − e− I(x 0) α (0, 10),β (0, 10) Γ(α) ≥ ∼ U ∼ U b. (a, b) denotes the uniform distribution over interval uk Proportions pk = K , uk (0, 1) U k=1 uk ∼ U [a, b]. A distribution (F0 or Fk) is randomly chosen via its n P parameter(s). The unbiasedness of gCovκ requires that there are at least two data points for each value of Y . Therefore, random proportions that do not meet this requirement are tics under different values of K for the three distribution removed. families with a fixed kernel parameter (σ2 = 10). Two Table 2 illustrates the performance of the four test statis- measures are computed from ROC: Power(α = 0.05) and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 11 Area Under Curve (AUC), where Power(α = 0.05) is calcu- 2) When the data has only 2 classes, i.e., K = 2, lated from ROC at Type I error is 0.05. Both measures have dCovκ(X,Y ) = 2p1p2gCovκ(X,Y ). values between 0 and 1 with a value closer to 1 indicating n n better performance. The highest power and AUC among Hence, when n is sufficiently large, dCovκ and gCovκ will the four test statistics are shown in bold and the second have the same ranking for the features. n The difference between Gini and distance statistics is highest are underlined. In this experiment, gCovκ appears to be the most competitive test statistics in terms of ROC more observable in the value range, as shown in Fig. 3(b). n Both dCorn and gCorn are bounded between 0 and 1, related measures at all values of K. In addition, both gCovκ κ κ n n n n n and gCor outperform dCov and dCor in most of the but clearly gCorκ takes a much wider range than dCorκ. κ κ κ n cases. We also tested the influence of σ2 and observed stable Therefore, gCorκ is a more sensitive measure of dependence n n n results (figures are provided in supplementary materials). than dCorκ. gCovκ is also more sensitive than dCovκ as shown both empirically in Fig. 3(b) and theoretically by (11). 5.2 The MNIST Dataset We first tested feature selection methods using different test 5.3 The Breast Cancer Dataset statistics on the MNIST data. The advantage of using an im- We then compared the 5 feature selection methods on a age dataset like MNIST is that we can visualize the selected gene selection task. The dataset used in this experiment was pixels. We expect useful/dependent pixels to appear in the the TCGA breast cancer microarray dataset from the UCSC center part of the image. Some descriptions of the MNIST Xena database [94]. This data contains expression levels of data are listed in TABLE 3. 17278 genes from 506 patients and each patient has a breast The 5 test statistics under comparisons are: Pearson R2, cancer subtype label (luminal A, luminal B, HER2-enriched, dCovn, dCorn, gCovn, and gCorn. For each method, the κ κ κ κ or basal-like). PAM50 is a gene signature consisting of 50 top k features were selected by ranking the test statistics in genes derived from microarray data and is considered as descending order. Then the selected feature set was used to the gold-standard for breast cancer subtype prognosis and train the same classifier and the test accuracies were com- prediction [95]. In this experiment, we randomly hold-out pared. The classifier used was a random forest consisting 20% as test data, used each method to select top k genes, of 100 trees. We used the training and test set provided by evaluate the classification performance and compare the [93] for training and testing. However, due to the size of selected genes with the PAM50 gene signature. Because the training set, using all training samples to calculate Gini this dataset has a relatively small sample size, all training and distance statistics is too time-consuming. Therefore, we examples were used to calculate test statistics and train the randomly selected 5000 samples from the training set to classifier, and we repeated each classification test 30 times. calculate test statistics for all methods. For Gini and distance Other experiment setups were kept the same as previous. statistics, each feature was standardized by subtracting the The results are shown in Fig. 4. Fig. 4(a) shows the mean and dividing by the standard deviation. The kernel classification performance using selected top k genes using parameter σ2 was set to be 10. Due to the different test statistics as well as using all PAM50 genes involved in training random forest, each experiment was (shown as a green dotted line). Note that the accuracy of repeated 10 times and the average test accuracy was used PAM50 is one averaged value from 30 runs. We plot it for comparison. as a line across the entire k range for easier comparison The results of the MNIST data are summarized in Fig. 3. with other methods. Among the 5 selection methods under Fig. 3(a) shows the test accuracies using the top k features comparison, gCovn has the best overall performance, even selected by different methods. Fig. 3(b) shows the test statis- κ outperforms PAM50 with k = 30. This suggests that gCovn tics in descending order. Fig. 3(c) shows the visualization κ is able to select a smaller number of genes and the prediction of the selected pixels as white. From Fig. 3(a) we can see is better than the . We also observe that gCorn the clear increasing trend in accuracy as more features are κ outperforms PAM50 with k = 40 and k = 50. Pearson R2, selected, as expected. Among all methods, Pearson R2 per- dCovn, and gCorn are not able to exceed PAM50 within forms the poorest. This is because it is unable to select some κ κ 50 genes. Fig. 4(b) shows the number of PAM50 genes of the pixels in the center part of the image as dependent appear in the top k selected genes for each selection method. features even with k = 600, as shown in Fig. 3(c). The It is obvious to see that Pearson R2 selects the smallest discrepancy in accuracy between Pearson R2 and the other number of PAM50. Among the top 2000 genes selected by four test statistics comes from the ability of characterizing Pearson R2, only half of the PAM50 gene are included. non-linear dependence. The other four methods behave Gini statistics are able to select more PAM50 genes than similar except that gCorn has significant higher accuracy κ distance statistics as k increases. The small ratio of PAM50 with k = 10 and lower accuracy with k = 50 than the other included in the selected genes by any method is because of three (dCovn, dCorn, and gCovn). Specifically, we expect κ κ κ the high correlation between genes. PAM50 was derived by dCovn and gCovn to behave very similarly because MNIST κ κ not only selecting most subtype dependent genes, but also is a balanced dataset. Under the following two scenarios less mutually dependent genes to obtain a smaller set of dCov (X,Y ) and gCov (X,Y ) will give the same ranking κ κ genes for the same prediction accuracy. Even though any of of the the features because their ratio is a constant (Remark the selection methods under comparison does not take the 2.8 & 2.9 of [19]): n feature-feature dependence into consideration, both gCovκ n 1) When the data is balanced, i.e., p1 = p2 = ... = and gCorκ are able to select a better gene set than PAM50 1 1 pK = K , dCovκ(X,Y ) = K gCovκ(X,Y ); for classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 12 TABLE 2: Power (α = 0.05) and AUC.

Power AUC n n n n n n n n dCovκ dCorκ gCovκ gCorκ dCovκ dCorκ gCovκ gCorκ Normal 0.977 0.976 0.984 0.979 0.994 0.993 0.995 0.994 K = 3 Exponential 0.709 0.705 0.730 0.715 0.887 0.890 0.894 0.895 Gamma 0.971 0.972 0.979 0.976 0.991 0.992 0.993 0.993 Normal 0.993 0.992 0.995 0.993 0.998 0.998 0.999 0.998 K = 4 Exponential 0.779 0.769 0.799 0.781 0.922 0.922 0.928 0.927 Gamma 0.992 0.991 0.994 0.993 0.998 0.998 0.998 0.998 Normal 0.997 0.996 0.999 0.998 0.999 0.999 1.000 0.999 K = 5 Exponential 0.821 0.809 0.839 0.818 0.940 0.939 0.947 0.944 Gamma 0.997 0.997 0.998 0.998 0.999 0.999 1.000 0.999

5.4 UCI Datasets TABLE 3: Data Sets Summary

We further tested the 5 feature selection methods on a total Data Set Train Size Test Size Features Classes of 15 UCI datasets of classification tasks using numerical MNIST 60000 10000 784 10 features. The 15 datasets cover a wide range on both the Breast Cancer 413 103 17278 5 sample and feature set size. Specifically, we avoided binary- Gene Expression 641 160 20531 5 n n Gastro Lesions 61 15 1396 3 class and balanced datasets because dCovκ and gCovκ give Satellite 4435 2000 36 6 the same ranking on these datasets when sufficient training Ecoli 269 67 7 8 samples are given. For datasets without training and test Glass 171 43 9 6 sets provided, we randomly hold out 20% as the test set. Urban Land 168 507 147 9 Wine 142 36 13 3 The descriptions of these datasets used are summarized in Anuran Calls 5756 1439 22 4 TABLE 3. For the MNIST and UJInddorLoc datasets, we Breast Tissue 85 21 9 4 randomly selected 5000 samples from the training set to Cardiotocography 1701 425 21 10 Leaf 272 68 14 30 calculate test statistics. For the remaining datsets, all training Mice Protein Exp 864 216 77 8 samples were used. For each dataset, each method was used HAR 4252 1492 561 6 to select top k features for training with three different UJIndoorLoc 19937 1111 520 13 values of k. Each classification test was repeated 10 times. Forest Types 198 325 27 4 Other experimental setups were kept the same as previously described. n n gCovκ under-performing dCovκ in Type II error decreases As we do not have the ground truth of the dependent n to 0 exponentially as the sample size increases. gCovκ and features for most of these datasets, only classification accu- n n n gCorκ are also simpler to calculate than dCovκ and dCorκ. racy was used for evaluation. The average test accuracies n Extensive experiments were performed to compare gCovκ (from 10 runs) of the 5 methods under comparison with n different values of k on the 15 UCI datsets are summarized and gCorκ with other dependence measures in feature in TABLE 4. Of the 5 methods, the highest accuracy is shown selection tasks using artifical and real datasets, including MNIST, breast cancer and 15 UCI datsets. For simulated in bold and the second highest one is underlined. The top n n datasets, gCovκ and gCorκ perform better in terms of power 1 statistic is calculated by how many times a method is n n shown in bold and the top 2 statistic is obtained by the total and AUC. For real datasets, on average, gCovκ and gCorκ number of times a method is shown in bold or underlined. are able to select more meaningful features and have better n classification performances. Among all methods, gCorκ appears 18 times as top 1 and 28 times in top 2, outperforming all other methods. The second The proposed feature selection method using general- n n n ized Gini distance statistics have several limitations: best method is dCovκ, followed by gCovκ and then dCorκ. 2 Peasron R is still the worst, with only 8 times as top 1 and Choosing an optimal σ is still an open problem. • 14 time in top 2. Comparing TABLE 4 and Fig. 3(b), we see In our experiments we used σ2 = 10 after data no correlation between sensitivity and ranking performance standardization; of test statistics. The ranking performance is data dependent n n 2 The computation cost for gCovκ and gCorκ is O(n ), and gCorn performances better on average. • n n κ which is same as dCovκ and dCorκ, but more expen- sive than Pearson R2 which takes linear time. For large datasets, a sampling of data is desired. ONCLUSIONS 6 C Features selected by Gini distance statistics, as well • We have proposed a feature selection framework based on as other dependence measure based methods, can be a new dependence measure between a numerical feature redundant, hence a subsequent feature elimination X and a categorical label Y using generalized Gini dis- may be needed for the sake of feature subset selec- tance statistics: Gini distance covariacne gCov(X,Y ) and tion. Gini distance correlation gCor(X,Y ). We presented estima- tors of gCov(X,Y ) and gCor(X,Y ) using n iid samples, n n i.e., gCovκ and gCorκ, and derived uniform convergence n bounds. We showed that gCovκ converge faster than its n distance statistic counterpart dCovκ, and the probability of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 13

(a) (b)

(c)

Fig. 3: The MNIST dataset. (a) Test accuracy using the top k selected features. (b) Test statistics of features in descending order. (c) Visualization of the top k pixels selected. White: selected. Black: not selected. Test accuracy using the selected pixels is labeled on the top of each image.

ACKNOWLEDGMENTS [4] L. Baringhaus and C. Franz, “On a New Multivariate Two-sample Test,” Journal of Multivariate Analysis, vol. 88, no. 1, pp. 190–206, This work was supported by the Big Data Constellation of 2004. the University of Mississippi. [5] R. Battiti, “Using Mutual Information for Selecting Features in Supervised Neural Net Learning,” IEEE Transactions on Neural Networks, vol. 5, no. 4, pp. 537–550, 1994. REFERENCES [6] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” In Advances in Neural [1] N. Armanfard, J. P. Reilly, M. Komeili, “Local Feature Selection Information Processing Systems, vol. 14, pp. 585-591, 2002. for Data Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1217–1227, 2016. [7] K. Benabdeslem and M. Hindawi, “Efficient Semi-Supervised Fea- [2] N. Aronszajn, “Theory of Reproducing Kernels,” Transactions of the ture Selection: Constraint, Relevance, and Redundacy,” IEEE Trans- American Society, vol. 68, no. 3, pp. 337-404, 1950. actions on Knowledge and Data Engineering, vol. 26, no. 5, pp. 1131– [3] A. Barbu, Y. She, L. Ding, and G. Gramajo, “Feature Selection 1143, 2014. with Annealing for Computer Vision and Big Data Learning,” IEEE [8] J. R. Berrendero, A. Cuevas, and J. L. Torrecilla, “Variable Selection Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. in Functional Data Classification: A Maxima-Hunting Proposal,” 2, pp. 272–286, 2017. Statistica Sinica, vol. 26, no. 2, pp. 619–638, 2016. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 14

(a) (b)

Fig. 4: The breast cancer dataset. (a) Test accuracy using the top k selected genes. (b) Number of PAM50 genes in the selected top k genes.

TABLE 4: Classification Accuracies Using Top k Dependent Features.

Gene Expression Gastrointestinal Lesions Satellite Ecoli k 5 7 10 10 100 200 5 10 15 2 3 4 Pearson R2 0.909 0.917 0.934 0.747 0.700 0.660 0.598 0.830 0.869 0.624 0.772 0.776 n dCovκ 0.889 0.907 0.963 0.700 0.607 0.673 0.627 0.860 0.896 0.712 0.766 0.788 n dCorκ 0.851 0.928 0.937 0.787 0.633 0.680 0.808 0.887 0.903 0.713 0.764 0.787 n gCovκ 0.923 0.933 0.948 0.560 0.647 0.693 0.834 0.870 0.881 0.718 0.766 0.781 n gCorκ 0.741 0.980 0.972 0.607 0.700 0.653 0.838 0.873 0.898 0.634 0.758 0.803 Glass Urban Land Cover Wine Anuran Calls k 3 5 7 30 60 90 2 4 6 5 10 15 Pearson R2 0.702 0.730 0.707 0.764 0.778 0.798 0.753 0.969 0.972 0.927 0.957 0.975 n dCovκ 0.702 0.691 0.744 0.786 0.795 0.809 0.897 0.997 1.000 0.938 0.957 0.979 n dCorκ 0.707 0.672 0.751 0.781 0.817 0.809 0.881 0.992 0.997 0.937 0.956 0.978 n gCovκ 0.705 0.674 0.663 0.784 0.790 0.804 0.881 0.997 0.997 0.938 0.956 0.979 n gCorκ 0.693 0.665 0.670 0.787 0.804 0.805 0.900 0.997 1.000 0.937 0.958 0.980 Breast Tissue Cardiotocography Leaf Mice Protein Expression k 3 5 7 5 10 15 4 7 10 10 20 30 Pearson R2 0.810 0.857 0.833 0.831 0.890 0.894 0.437 0.637 0.647 0.978 0.942 0.980 n dCovκ 0.819 0.857 0.852 0.773 0.874 0.891 0.471 0.690 0.663 0.893 0.950 0.985 n dCorκ 0.810 0.857 0.838 0.816 0.849 0.894 0.506 0.606 0.704 0.890 0.952 0.977 n gCovκ 0.810 0.857 0.852 0.817 0.878 0.895 0.468 0.688 0.654 0.888 0.945 0.983 n gCorκ 0.810 0.857 0.857 0.766 0.880 0.887 0.503 0.594 0.706 0.888 0.943 0.982 HAR UJIndoorLoc Forest Types Top 1 Top 2 k 100 200 300 100 200 300 5 10 15 (times) (times) Pearson R2 0.756 0.780 0.856 0.695 0.811 0.847 0.750 0.758 0.795 8 14 n dCovκ 0.765 0.779 0.859 0.768 0.866 0.873 0.653 0.810 0.804 11 25 n dCorκ 0.767 0.780 0.853 0.793 0.863 0.871 0.715 0.756 0.804 10 23 n gCovκ 0.766 0.853 0.858 0.763 0.864 0.872 0.651 0.817 0.805 10 25 n gCorκ 0.821 0.853 0.853 0.797 0.864 0.872 0.695 0.735 0.805 18 28

[9] A. L. Blum and P. Langley, “Selection of Relevant Features and Fast Selecting Maximally Separable Feature Subset for Multiclass Examples in Machine Learning,” Artificial Intelligence, vol. 97, nos. Classification with Applications to High-Dimensional Data,” IEEE 12, pp. 245–271, 1997. Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. [10] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 6, pp. 1217–1233, 2011. 5–32, 2001. [15] T. W. S. Chow and D. Huang, “Estimating Optimal Feature Subsets [11] M. Bressan and J. Vitria,` “On the Selection and Classification of Using Efficient Estimation of High-Dimensional Mutual Informa- Independent Features,” IEEE Transactions on Pattern Analysis and tion,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 213– Machine Intelligence, vol. 25, no. 10, pp. 1312–1317, 2003. 224, 2005. [12] S. Canu, “Functional Learning Through Kernels,” Advances in [16] P. Comon, “Independent component analysis, A new concept?” Learning Theory: Methods, Models and Application, J. Suykens, G. Signal Processing, vo. 36, no. 3, pp. 287–314, 1994. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Ed.), pp. 89–110, 2003. [17] C. Constantinopoulos, M. K. Titsias, and A. Likas, “Bayesian [13] R. Chakraborty and N. R. Pal, “Feature Selection Using a Neural Feature and for Gaussian Mixture Models,” IEEE Framework With Controlled Redundancy,” IEEE Transactions on Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. Neural Networks and Learning Systems, vol. 26, no. 1, pp. 35–50, 2015. 6, pp. 1013–1018, 2006. [14] Q. Cheng, H. Zhou, and J. Cheng, “The Fisher-Markov Selector: [18] B. B. Damodaran, N. Courty, and S. Lefevre,` “Sparse Hilbert IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 15

Schmidt Independence Criterion and Surrogate-Kernel-Based Fea- [42] N. Kwak and C. Choi, “Input Feature Selection by Mutual In- ture Selection for Hyperspectral Image Classification,” IEEE Trans- formation Based on Parzen Window, IEEE Transactions on Pattern actions on Geoscience and Remote Sensing, vol. 55, no. 4, pp. 2385–2398, Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, 2002. 2017. [43] D. D. Lee and H. S. Seung, “Learning the Parts of Objects by Non- [19] X. Dang, D. Nguyen, Y. Chen, J. Zhang “A New Gini Correlation negative Matrix Factorization,” Nature, vol. 401, pp. 788–791, 1999. between Quantitative and Qualitative Variables”, arXiv:1809.09793, [44] L. Lefakis and F. Fleuret, “Jointly Informative Feature Selection 2018. Made Tractable by Gaussian Modeling,” Journal of Machine Learning [20] P. Demartines and J. Herault,´ “Curvilinear Component Analysis: Research, vol. 17, pp. 1-39, 2016. A Self-organizing Neural Network for Nonlinear Mapping of Data [45] R. Li, W. Zhong, and L. Zhu, “Feature Screening via Distance Sets,” IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 148- Correlation Learning,” Journal Journal of the American Statistical 154, 1997. Association, vol. 107, no. 499, pp. 1129–1139, 2012. [21] A. A. Ding, J. G. Dy, Y. Li, and Y. Chang, “A Robust-Equitable [46] H. Liu and L. Yu, “Toward Integrating Feature Selection Al- Measure for Feature Ranking and Selection,” Journal of Machine gorithms for Classification and Clustering,” IEEE Transactions on Learning Research, vol. 18, pp. 1-46, 2017. Knowledge and Data Engineering, vol. 17, no. 4, pp. 491–502, 2005. [22] E. Edgington and P. Onghena, Tests, 4th edition, [47] R. Lyons, “Distance Covariance in Metric Spaces,” The Annals of Chapman and Hall/CRC, 2007. Probability, vol. 41, no. 5, pp. 3284–3305, 2013. [23] J. Fan and J. Lv, “Sure Independence Screen for Ultrahigh Dimen- [48] P. Maji and S. K. Pal, “Feature Selection Using f-Information sional Feature Space,” Journal of the Royal Statistical Society, B, vol. Measures in Fuzzy Approximation Spaces,” IEEE Transactions on 70, no. 5, pp. 849–911, 2008. Knowledge and Data Engineering, vol. 22, no. 6, pp. 854–867, 2010. [24] J. Fan, R. Samworth, Y. Wu, “Ultrahigh Dimensional Feature [49] Q. Mao and I. W. Tsang, “A Feature Selection Method for Multi- Selection: Beyond the ,” Journal of Machine Learning variate Performance Measures,” IEEE Transactions on Pattern Analy- Research, vol. 10, pp. 2013–2038, 2009. sis and Machine Intelligence, vol. 35, no. 9, pp. 2051–2063, 2013. [25] C. Gini, Variabilit`a e Mutabilit`a: Contributo Allo Studio Delle Dis- [50] J. Mercer, ”Functions of Positive and Negative Type, and Their tribuzioni e Relazioni Statistiche, Vol. III (part II), Bologna: Cuppini, Connection the Theory of Integral Equations,” Philosophical Trans- 1912. actions of The Royal Society A, vol. 209, pp. 415-446, 1909. [26] C. Gini, “Sulla misura della concentrazione e della variabilita` dei [51] P. Mitra, C. A. Murthy, and K. Pal, “Unsupervised Feature Selec- caratteri,” Atti del Reale Istituto Veneto di Scienze, Lettere ed Aeti, tion Using Feature Similarity,” IEEE Transactions on Pattern Analysis 62, 1203-1248, 1914. English Translation: On the measurement of and Machine Intelligence, vol. 24, no. 3, pp. 301–312, 2002. concentration and variability of characters. Metron, LXIII(1), 3-38, [52] T. Naghibi, S. Hoffmann, and B. Pfister, “A Semidefinite Pro- 2005. gramming Based Search Strategy for Feature Selection with Mutual [27] J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, “Feature Selection Based on Information Measure,” IEEE Transactions on Pattern Analysis and Structured Sparsity: A Comprehensive Study.” IEEE Transactions on Machine Intelligence, vol. 37, no. 8, pp. 1529–1541, 2015. Neural Networks and Learning Systems, vol. 28, no. 7, pp. 1490–1507, [53] R. Nilsson, J. M. Pena,˜ J. Bjorkegren,¨ and J. Tegner,´ “Consistent 2017. Feature Selection for Pattern Recognition in Polynomial Time,” [28] I. Guyon, A. Elisseeff, “An Introduction to Variable and Feature Journal of Machine Learning Research, vol. 8, pp. 589–612, 2007. Selection,” Journal of Machine Learning Research, vol. 3, pp. 1157– [54] J. Novovicova,´ P. Pudil, and J. Kittler, “Divergence Based Feature 1182, 2003. Selection for Multimodal Class Densities,” IEEE Transactions on [29] X. He, M. Ji, C. Zhang, and H. Bao, “A Variance Minimization Pattern Analysis and Machine Intelligence, vol. 18, no. 2, pp. 218–223, Criterion to Feature Selection Using Laplacian Regularization,” 1996. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 2013–2025, 2011. [55] K. Pearson, “Notes on Regression and Inheritance in the Case of Two Parents,” , vol. 58, pp. [30] G. E. Hinton and S. T. Roweis, “Stochastic Neighbor Embedding,” Proceedings of the Royal Society of London 240-242, 1895. In Advances in Neural Information Processing Systems, vol. 15, pp. 833-840, 2002. [56] H. Peng, F. Long, C. Ding, “Feature Selection Based on Mutual In- [31] B. Hjørland, “The Foundation of the Concept of Relevance,” formation: Criteria of Max-Dependency, Max-Relevance, and Min- Journal of the American Society for Information Science and Technology, Redundancy,” IEEE Transactions on Pattern Analysis and Machine vol. 61, no. 2, pp. 217–237, 2010. Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005. [32] H. Hotelling, “Analysis of a Complex of Statistical Variables into [57] S. Ren, S. Huang, J. Ye, and X. Qian, “Safe Feature Screening Principal Components,” Journal of Educational Psychology, vol. 24, for Generalized LASSO,” IEEE Transactions on Pattern Analysis and no. 6, pp. 417–441, 1933. Machine Intelligence, vol., no., 14 pages, 2018. [33] W. Hoeffding, “ Class of Statistics With Asymptotically Normal [58] S. T. Roweis and L. L. Saul, “Locally Linear Embedding,” Science, Distribution,” Annals of , vol. 19, pp. 293–325, vol. 290, pp. 2323-2326, 2000. 1948. [59] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, [34] X. Hou and G. Szekely,´ “Fast Computing for Distance Covari- ”Equivalence of Distance-based and RKHS-based Statistics in Hy- ance,” Technometrics, vol. 58, no. 4, pp. 435–447, 2016. pothesis Testing”, The Annals of Statistics, vol. 41, no. 5, pp. 2263– [35] F. J. Iannarilli Jr. and P. A. Rubin, “Feature Selection for Multi- 2291, 2013. class Discrimination via Mixed-Integer Linear Programming,” IEEE [60] R. Serfling, “Approximation Theorems of Mathematical Statistics,” Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. John Wiley & Sons, New York, 1980. 6, pp. 779–783, 2003. [61] M. Shah, M. Marchand, and J. Corbeil, “Feature Selection with [36] K. Javed, H. A. Babri, and M. Saeed, “Feature Selection Based Conjunctions of Decision Stumps and Learning from Microarray on Class-Dependent Densities for High-Dimensional Binary Data,” Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 3, vol. 34, no. 1, pp. 174–186, 2012. pp. 465–477, 2012. [62] V. Sindhwani, S. Rakshit, D. Deodhare, D. Erdogmus, J. C. [37] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant Feature and the Principe, and P. Niyogi, “Feature Selection in MLPs and SVMs Subset Selection Problem,” In Proceedings of the Eleventh International Based on Maximum Output Information,” IEEE Transactions on Conference on Machine Learning, pp. 121-129, 1994. Neural Networks, vol. 15, no. 4, pp. 937–948, 2004. [38] R. Kohavi and G. H. John, “Wrappers for Feature Subset Selec- [63] E. Slutsky, E. (1925). “Uber¨ Stochastische Asymptoten und Gren- tion,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997. zwerte”, Metron (in German), vol. 5, no. 3, pp. 3–89, 1925. [39] G. Koshevoy and K. Mosler, “Multivariate Gini indices,” Journal of [64] A. J. Smola, A. Gretton, L. Song, and B. Scholkopf,¨ “A Hilbert Multivariate Analysis vol. 60, no. 2, pp. 252–276, 1997. Space Embedding for Distributions,” In Proceedings of the Conference [40] B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. on Algorithmic Learning Theory (ALT) vol. 4754, pp. 13-31, 2007. Figueiredo, “A Bayesian Approach to Joint Feature Selection and [65] P. Somol, P. Pudil, and J. Kittler, “Fast Branch & Bound Algorithms Classifier Design,” IEEE Transactions on Pattern Analysis and Machine for Optimal Feature Selection,” IEEE Transactions on Pattern Analysis Intelligence, vol. 26, no. 9, pp. 1105–1111, 2004. and Machine Intelligence, vol. 26, no. 7, pp. 900–912, 2004. [41] N. Kwak and C. Choi, “Input Feature Selection for Classification [66] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, “Fea- Problems,” IEEE Transactions on Neural Networks, vol. 13, no. 1, pp. ture Selection via Dependence Maximization,” Journal of Machine 143–159, 2002. Learning Research, vol. 13, pp. 1393–1434, 2012. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 16

[67] H. Stoppiglia, G. Dreyfus, R. Dubois, Y. Oussar, “Ranking a Ran- [92] L. Zhou, L. Wang, and C. Shen, “Feature Selection With dom Feature for Variable and Feature Selection,” Journal of Machine Redundancy-Constrained Class Separability,” IEEE Transactions on Learning Research, vol. 3, pp. 1399–1414, 2003. Neural Networks, vol. 21, no. 5, pp. 853–858, 2010. [68] Y. Sun, S. Todorovic, and S. Goodison, “Local-Learning-Based [93] Yann LeCun and Corinna Cortes. MNIST handwritten digit Feature Selection for High-Dimensional Data Analysis,” IEEE Trans- database. 2010. actions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. [94] Mary Goldman, Brian Craft, Angela N Brooks, Jingchun Zhu, and 1610–1626, 2010. David Haussler. The ucsc xena platform for cancer genomics data [69] G. J. Szekely´ and M. L. Rizzo, “Testing for Equal Distributions in visualization and interpretation. bioRxiv, 2018. High Dimension,” InterStat, vol. 5, 2004. [95] Joel S Parker, Michael Mullins, Maggie CU Cheang, Samuel Leung, [70] G. J. Szekely´ and M. L. Rizzo, “A New Test for Multivariate David Voduc, Tammi Vickery, Sherri Davies, Christiane Fauron, Normality,” Journal of Multivariate. Analysis, vol. 93, no. 1, pp. 58–80, Xiaping He, Zhiyuan Hu, et al. Supervised risk predictor of breast 2005. cancer based on intrinsic subtypes. Journal of clinical oncology, [71] G. J. Szekely,´ M. L. Rizzo, N. K. Bakirov, “Measuring and Testing 27(8):1160, 2009. Dependence by Correlation of Distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007. [72] G. J. Szekely´ and M. L. Rizzo, “Brownian Distance Covariance,” Annals of Applied Statistics, vol. 3, no. 4, pp. 1233–1303, 2009. [73] G. J. Szekely´ and M. L. Rizzo, “Energy Statistics: A Class of Statistics Based on Distances,” Journal of Statistical Planning and Inference, vol. 143, no. 8, pp. 1249–1272, 2013. [74] G. J. Szekely´ and M. L. Rizzo, “Partial Distance Correlation with Methods for Dissimilarities,” The Annals of Statistics, vol. 42, no. 6, Silu Zhang received the BS degree in bioengi- pp. 2382–2412, 2014. neering from Zhejiang University, China and the MS degree in chemical engineering from North [75] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global Carolina State University, both in 2011. She is Geometric Framework for Nonlinear Dimensionality Reduction,” PLACE currently a PhD candidate in computer science Science, vol. 290, pp. 2319-2323, 2000. PHOTO at the University of Mississippi, working on ma- [76] W. S. Torgerson, “Multidimensional Scaling I: Theory and HERE chine learning research in feature selection and Method,” Psychometrika, vol. 17, no. 4, pp. 401-419, 1952. improving random forest, as well as application [77] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Feature Selection of machine learning and statistical models on with Ensembles, Artificial Variables, and Redundancy Elimina- gene expression data. tion,” Journal of Machine Learning Research, vol. 10, pp. 1341–1366, 2009. [78] H. Wang, D. Bell, and F. Murtagh, “Axiomatic Approach to Feature Subset Selection Based on Relevance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 3, pp. 271–277, 1999. [79] J. Wang, P. Zhao, S. C. H. Hoi, and R. Jin, “Online Feature Selection and Its Applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 698–710, 2014. [80] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval,” IEEE Xin Dang received the BS degree in Applied Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. Mathematics (1991) from Chongqing University, 10, pp. 2010–2023, 2016. China, the Master degree (2003) and the PhD [81] L. Wang, “Feature Selection with Kernel Class Separability,” IEEE degree (2005) in Statistics from the University of Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. PLACE Texas at Dallas. Currently she is a professor of 9, pp. 1534–1546, 2008. PHOTO the Department of Mathematics at the Univer- [82] L. Wang, N. Zhou, and F. Chu, “A General Wrapper Approach to HERE sity of Mississippi. Her research interests include Selection of Class-Dependent Features,” IEEE Transactions on Neural robust and , statistical Networks, vol. 19, no. 7, pp. 1267–1278, 2008. and numerical computing, and multivariate data [83] H. Wei and S. A. Billings, “Feature Subset Selection and Ranking analysis. In particular, she has focused on data for Data Dimensionality Reduction,” IEEE Transactions on Pattern depth and applications, , machine Analysis and Machine Intelligence, vol. 29, no. 1, pp. 162–166, 2007. learning, and robust procedure computation. Dr. Dang is a member of [84] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online Feature the IMS, ASA, ICSA and IEEE. Selection with Streaming Features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178–1192, 2013. [85] Z. J. Xiang, Y. Wang, and P. J. Ramadge, “Screening Tests for Lasso Problems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 5, pp. 1008–1027, 2017. [86] S. Yang and B. Hu, “Discriminative Feature Selection by Nonpara- metric Bayes Error Minimization,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 8, pp. 1422–1434, 2012. [87] C. Yao, Y. Liu, B. Jang, J. Han, and J. Han, “LLE Score: A Dao Nguyen received Bachelor of Computer New Filter-Based Unsupervised Feature Selection Method Based Science (1997) from the University of Wollon- on Nonlinear Manifold Embedding and Its Application to Image gong, Australia, Ph.D. of Electrical Engineering Recognition,” IEEE Transactions on Image Processing, vol. 26, no. 11, (2010) from the University of Science and Tech- pp. 5257–5269, 2017. PLACE nology, Korea. In 2016, he received the Ph.D. [88] L. Yu and H. Liu, “Efficient Feature Selection via Analysis of PHOTO degree in Statistics from The University of Michi- Relevance and Redundancy,” Efficient Feature Selection via Analysis HERE gan, Ann Arbor. He was a postdoc scholar at the of Relevance and Redundancy, vol. 5, pp. 1205–1224, 2004. University of California, Berkeley for more than [89] H. Zeng and Y. Cheung, “Feature Selection and Kernel Learning a year before becoming an Assistant Professor for Local Learning-Based Clustering,” IEEE Transactions on Pattern of Mathematics at the University of Mississippi Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1532–1547, 2011. in 2017. His research interests include machine [90] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On Similarity Preserving Fea- learning, dynamics modeling, stochastic optimization, Bayesian analy- ture Selection,” IEEE Transactions on Knowledge and Data Engineering, sis. Dr. Nguyen is a member of the ASA, ISBA. vol. 25, no. 3, pp. 619–632, 2013. [91] Y. Zhai, Y. Ong, and I. Tsang, “Making Trillion Correlations Feasi- ble in Feature Grouping and Selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2472–2486, 2016. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 17

Dawn Wilkins received the B.A. and M.A. de- grees in Mathematical Systems from Sanga- mon State University (now University of Illinois– Springfield) in 1981 and 1983, respectively. After PLACE earning a Ph.D. in Computer Science from Van- PHOTO derbilt University in 1995, she joined the faculty HERE of the University of Mississippi, where she is now Professor of Computer and Information Science. Dr. Wilkins’ research interests are primarily in machine learning, computational biology, bioin- formatics and database systems.

Yixin Chen received B.S.degree (1995) from the Department of Automation, Beijing Polytechnic University, the M.S. degree (1998) in control theory and application from Tsinghua University, PLACE and the M.S. (1999) and Ph.D. (2001) degrees PHOTO in electrical engineering from the University of HERE Wyoming. In 2003, he received the Ph.D. de- gree in computer science from The Pennsylvania State University. He had been an Assistant Pro- fessor of computer science at University of New Orleans. He is now a Professor of Computer and Information Science at the University of Mississippi. His research interests include machine learning, data mining, computer vision, bioin- formatics, and robotics and control. Dr. Chen is a member of the ACM, the IEEE, the IEEE Computer Society, and the IEEE Neural Networks Society. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 1 Estimating Feature-Label Dependence Using Gini Distance Statistics

Silu Zhang, Xin Dang, Member, IEEE, Dao Nguyen, Dawn Wilkins, Yixin Chen, Member, IEEE

!

SUPPLEMENTARY MATERIAL This document presents the effect of the kernel parameter σ on performance of the four test statistics under different values of K for three distribution families, as shown in Figure S1. The performance of each method becomes more insensitive to σ as σ increases and a larger σ is more favorable. The relative performances of the four methods are barely affected by the choice of σ. arXiv:1906.02171v1 [cs.LG] 5 Jun 2019 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. , 2019 2

(a) K = 3

(b) K = 4

(c) K = 5

Fig. S1: Power of test statistics at 0.05 Type I error and AUC. Left: Normal distribution family, Middle: Exponential distribution family, Right: Gamma distribution family.