Two-sample Testing Using Deep Learning

Matthias Kirchler1,2 Shahryar Khorasani1 Marius Kloft2,3 Christoph Lippert1,4 1Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany 2Technical University of Kaiserslautern, Germany 3University of Southern California, Los Angeles, United States 4Hasso Plattner Institute for Digital Health at Mount Sinai, New York, United States

Abstract Two-sample tests are a pillar of applied and a standard method for analyzing empirical data in the We propose a two-sample testing procedure sciences, e.g., medicine, biology, psychology, and so- based on learned deep neural network repre- cial sciences. In machine learning, two-sample tests sentations. To this end, we define two test have been used to evaluate generative adversarial net- statistics that perform an asymptotic loca- works (Bińkowski et al., 2018), to test for covariate tion test on data samples mapped onto a shift in data (Zhou et al., 2016), and to infer causal hidden layer. The tests are consistent and relationships (Lopez-Paz and Oquab, 2016). asymptotically control the type-1 error rate. There are two main types of two-sample tests: paramet- Their test statistics can be evaluated in lin- ric and non-parametric ones. Parametric two-sample ear time (in the sample size). Suitable data tests, such as the Student’s t-test, make strong assump- representations are obtained in a data-driven tions on the distribution of the data (e.g. Gaussian). way, by solving a supervised or unsupervised This allows us to compute p-values in closed form. How- transfer-learning task on an auxiliary (poten- ever, parametric tests may fail when their assumptions tially distinct) data set. If no auxiliary data on the data distribution are invalid. Non-parametric is available, we split the data into two chunks: tests, on the other hand, make no distributional as- one for learning representations and one for sumptions and thus could potentially be applied in a computing the test . In wider of application scenarios. Computing non- on audio samples, natural images and three- parametric test statistics, however, can be costly as it dimensional neuroimaging data our tests yield may require applying re- schemes or comput- significant decreases in type-2 error rate (up ing higher-order statistics. to 35 percentage points) compared to state- of-the-art two-sample tests such as kernel- A non-parametric test that gained a lot of attention methods and classifier two-sample tests.∗ in the machine-learning community is the kernel two- sample test and its test statistic: the maximum discrepancy (MMD). MMD computes the average dis- 1 INTRODUCTION tance of the two samples mapped into the reproducing kernel Hilbert space (RKHS) of a universal kernel (e.g., For almost a century, statistical hypothesis testing Gaussian kernel). MMD critically relies on the choice has been one of the main methodologies in statistical of the feature representation (i.e., the kernel function) inference (Neyman and Pearson, 1933). A classic prob- and thus might fail for complex, structured data such lem is to validate whether two sets of observations are as sequences or images, and other data where deep drawn from the same distribution (null hypothesis) or learning excels. not (alternative hypothesis). This procedure is called Another non-parametric two-sample test is the classifier two-sample test. two-sample test (C2ST). C2ST splits the data into two ∗We provide code at https://github.com/mkirchler/ chunks, training a classifier on one part and evaluating deep-2-sample-test it on the remaining data. If the classifier predicts significantly better than chance, the test rejects the Proceedings of the 23rdInternational Conference on Artificial null hypothesis. Since a part of the data set needs to Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. be put aside for training, not the full data set is used PMLR: Volume 108. Copyright 2020 by the author(s). for computing the test statistic, which limits the power Two-sample Testing Using Deep Learning

of the method. Furthermore, the performance of the a two-sample test on data drawn from p and q, i.e. method depends on the selection of the train-test split. we test the null hypothesis H0 : p = q against the alternative H : p 6= q. p0 and q0 are assumed to be in In this work, we propose a two-sample testing proce- 1 some sense similar to p and q, respectively, and act as dure that uses deep learning to obtain a suitable data auxiliary task for tuning the test (the case of p = p0 and representation. It first maps the data onto a hidden- q = q0 is perfectly valid, in which case this is equivalent layer of a deep neural network that was trained (in an to a data splitting technique). unsupervised or supervised fashion) on an independent, 0 auxiliary data set, and then it performs a location test. We have access to four (independent) sets Xn, Yn, Xn0 , 0 0 0 Thus we are able to work on any kind of data that neu- and Yn0 of observations drawn from p, q, p , and q , d ral networks can work on, such as audio, images, videos, respectively. Here Xn = {X1,...,Xn} ⊂ R and Xi ∼ 0 time-series, graphs, and natural language. We propose p for all i (analogue definitions hold for Yn, Xn0 , and 0 two test statistics that can be evaluated in linear time Yn0 ). Empirical averages with respect to a function f (in the number of observations), based on MMD and 1 Pn are denoted by f(Xn) := n i=1 f(Xi). Fisher discriminant analysis, respectively. We derive asymptotic distributions of both test statistics. Our We investigate function classes of deep ReLU networks theoretical analysis proves that the two-sample test with a final tanh activation function:  d H procedure asymptotically controls the type-1 error rate, TF N := tanh ◦WD−1 ◦ σ ◦ ... ◦ σ ◦ W1 : R → R has asymptotically vanishing type-2 error rate and is W ∈ RH×d,W ∈ RH×H for j = 2,...,D − 1, robust both with respect to transfer learning and ap- 1 j  proximate training. D−1 Y  ||W || ≤ β ,D ≤ D We empirically evaluate the proposed methodology in j F ro N N j=1  a variety of applications from the domains of computa- tional musicology, computer vision, and neuroimaging. Here, the activation functions tanh and σ(z) := In these experiments, the proposed deep two-sample ReLU(z) = max(0, z) are applied elementwise, || · ||F ro tests consistently outperform the closest competing is the Frobenius norm, H = d + 1 is the width and method (including deep kernel methods and C2STs) by DN and βN are depth and weight restrictions onto the up to 35 percentage points in terms of the type-2 error networks. This can be understood as the mapping onto rate, while properly controlling the type-1 error rate. the last hidden layer of a neural network concatenated with a tanh activation. 2 PROBLEM STATEMENT & NOTATION 3 DEEP TWO-SAMPLE TESTING In this section, we propose two-sample testing based on We consider non-parametric two-sample statistical test- two novel test statistics, the Deep Maximum Mean ing, that is, to answer the question whether two samples Discrepancy (DMMD) and the Deep Fisher Dis- are drawn from the same (unknown) distribution or not. criminant Analysis (DFDA). The test asymptoti- We distinguish between the case that the two samples cally controls the type-1 error rate, and it is consistent are drawn from the same distribution (the null hypoth- (i.e., the type-2 error rate converges to 0). Further- esis, denoted by H ) and the case that the samples 0 more, we will show that consistency is preserved under are drawn from different distributions (the alternative both transfer learning on a related task, as well as only hypothesis H ). 1 approximately solving the training step. We differentiate between type-1 errors (i.e,rejecting the null hypothesis although it holds) and type-2 errors (i.e., 3.1 Proposed Two-sample Test not rejecting H0 although it does not hold). We strive for both the type-1 error rate to be upper bounded by Our proposed test consists of the following two steps. some significance level α, and the type-2 error rate to 1. We train a neural network over an auxiliary training converge to 0 for unlimited data. The latter property is data set. 2. We then evaluate the maximum mean called consistency and that with sufficient data, discrepancy test statistic (Gretton et al., 2012a) (or the test can reliably distinguish between any pair of a variant of it) using as kernel the mapping from the probability distributions. input domain onto the network’s last hidden layer. 0 0 d Let p, q, p and q be probability distributions on R 3.1.1 Training Step with common dominating Borel measure µ. We abuse 0 0 0 notation somewhat and denote the densities with re- Let the training data be Xn0 and Ym0 . Denote N = n + spect to µ also by p, q, p0 and q0. We want to perform m0. We run a (potentially inexact) training algorithm Kirchler, Khorasani, Kloft, Lippert

to find φN ∈ T F N with: Interpretation as Empirical Risk Minimization 0 0 0 0  0 0  If we identify X with (Z , 1) and Y with (Z 0 , −1) n m i i i n +i 1 X 0 X 0 in a regression setting, this is equivalent to an (inexact)  φN (Xi) − φN (Yi ) + η N empirical risk minimization with L(t, tˆ) = i=1 i=1 1 − ttˆ:  0 0  n m 1 X 0 X 0 N N ≥ max  φ(Xi) − φ(Yi ) . 1 X 0 0 1 X 0 > 0 φ∈T F N N max tiφ(Zi) = max max tiw φ(Zi), i=1 i=1 φ N φ ||w||≤1 N i=1 i=1 Here, η ≥ 0 is a fixed leniency parameter (independent which is equivalent to of N); finding true global optima in neural networks is a hard problem, and an η > 0 allows us to settle N 0 > 1 X 0 > 0 with good-enough, local solutions. This procedure is min min RN (w φ) := L(ti, w φ(Zi)), (1) φ ||w||≤1 N also related to the early-stopping regularization tech- i=1 nique, which is commonly used in training deep neural 0 networks (Prechelt, 1998). where we denote by RN the empirical risk; the cor- responding expected risk is R0(f) = E[1 − t0f(Z0)]. 0 0 1 Assuming that Pr(t = 1) = Pr(t = −1) = 2 , we have 3.1.2 Test Statistic 0∗ 0 0 for the Bayes risk R = inff:Rd→[−1,1] R (f) = 1 −  We define the mean distance of the two test populations with 0 > 0 if and only if p0 6= q0. As long as p0 and q0 Xn, Ym measured on the hidden layer of a network φ are selected close enough to p and q, respectively, the as corresponding test will be able to distinguish between the two distributions. Dn,m(φ) := φ(Xn) − φ(Ym). Since we discard w after optimization and use the Using φN from the training step, we define the Deep norm of the hidden layer on the test set again, this Maximum Mean Discrepancy (DMMD) test statistic implies some fine-tuning on the test data, without as compromising the test statistic (see Theorem 3.1 below). nm 2 Sn,m(φN , Xn, Ym) := ||Dn,m(φN )|| . This property is especially helpful in neural networks, n + m since for practical transfer learning, only fine-tuning We can normalize this test statistic by the (inverse) the last layer can be extremely efficient, even if the empirical covariance matrix: transfer and actual task are relatively different (Lu nm et al., 2015). T (φ , X , Y ) := D (φ )>Σˆ −1 D (φ ). n,m N n m n + m n,m N n,m n,m N Relation to kernel-based tests The test statistic This leads to a test statistic (which we call Deep Fisher S is a special case of the standard squared Maximum Discriminant Analysis—DFDA) with an asymptotic n,m Mean Discrepancy (Gretton et al., 2012b) with the ker- distribution that is easier to evaluate. Note that the nel k(z , z ) := hφ(z ), φ(z )i (analogously for T empirical covariance matrix is defined as: 1 2 1 2 n,m and the Kernel FDA Test (Harchaoui et al., 2008)). Σˆ n,m := Σˆ n,m(φN ) := For a fixed feature map φ this kernel is not charac- m+n teristic, and hence the resulting test not necessarily 1 X (φ (Z ) − φ (Z))(φ (Z ) − φ (Z))> consistent for arbitrary distributions p, q. However, by n + m − 1 N i N N i N i=1 first choosing φ in a data-dependent way, we can still achieve consistency. + ρn,mI, where ρn,m > 0 is a factor guaranteeing numerical 3.2 Control of Type-1 Error stability and invertibility of the covariance matrix, and

Z = {Z1,...,Zm+n} = {X1,...,Xn,Y1,...,Ym}. Due to our choice of φN , there need not be a unique, well-defined limiting distribution for the test statistics 3.1.3 Discussion when n, m → ∞. Instead, we will show that for each fixed φ, the test statistic S has a well-defined lim- Intuitively, we map the data onto the last hidden layer n,m iting distribution that can be well evaluated. If in of the neural network and perform a multivariate loca- addition the covariance matrix is invertible, then the tion test on whether both map to the same location. same holds for Tn,m. If the distance Dn,m between the two means is too large, we reject the hypothesis that both samples are In particular, the following theorem will show that drawn from the same distribution. Consistency of this Dn,m(φ) converges towards a multivariate normal dis- procedure is guaranteed by the training step. tribution for n, m → ∞. Sn,m then is asymptotically Two-sample Testing Using Deep Learning

distributed like a weighted sum of χ2 variables, and estimating the null distribution via Monte-Carlo per- 2 Tn,m like a χH (again, if well-defined). mutation sampling (Ernst et al., 2004) is feasible. Note Theorem 3.1. Let p = q, φ ∈ T F and Σ := also that it suffices to evaluate the feature map φ on n each data point only once and then permute the class Cov(φ(X1)) and assume that n+m → r ∈ (0, 1) as n, m → ∞. labels, saving more time. In practice we found that the -based test (i) As n, m → ∞, it holds that performed considerably faster. Hence, in the remainder of this work, we will evaluate the null hypothesis of the r mn D (φ) →d N (0, Σ). DMMD via the resampling method. m + n n,m

Testing with Tn,m Since in many practical situa- (ii) As n, m → ∞, tions one wants to use standard neural network archi- tectures (such as ResNets), the number of neurons in H d X S (φ, X , Y ) → λ ξ2, the last hidden layer H may be rather large, compared n,m n m i i to n, m. Therefore, using the full, high-dimensional i=1 hidden layer representation might lead to suboptimal iid normal approximations. Instead, we propose to use where ξi ∼ N (0, 1) and λi are the eigenvalues of a principal component analysis on the feature repre- Σ. n+m sentation (φ(Zi))i=1 to reduce the dimensionality to (iii) If additionally Σ is invertible, and ρn,m ↓ 0 then Hˆ  m + n. In fact, this does not break the asymp- as n, m → ∞ totic theory derived in Theorem 3.1, even though the PCA is both trained and evaluated on the test data; d 2 Tn,m(φ, Xn, Ym) → χH . details can be found in Appendix C. Unfortunately,  1/4  the O H√ rate of convergence is not valid anymore, Sketch of proof (full proof in Appendix A.1). (i) As n due to the observations not being independent. We under H φ(X ) and φ(Y ) are identically distributed, 0 i j still need to grow Hˆ towards H with n, m in order D (φ) is centered and one can show the result using n,m for the consistency results in the next section to hold, a . q  however. Empirically we found Hˆ = min n+m ,H (ii) and (iii) then follow from the continuous map- 2 to perform well. ping theorem and properties of the multivariate normal 2 distribution. The cumulative distribution function of the χH distri- bution can be evaluated very efficiently. Although for Under some additional assumptions we can also use the DFDA it is also possible to estimate the null hy- a Berry-Esseen type of result to quantify the quality pothesis via a Monte Carlo permutation scheme, doing of the normal approximation of Dn,m(φN ) conditioned so is more costly than for the DMMD, since it involves on the training. In particular, if we assume that n = either a matrix inversion once or solving a linear system 0 0 for each permutation draw. Hence, in this work we m and Σ = Covp,q(φN (X1))|Xn, Yn invertible, then Bentkus (2005) shows that the normal approximation focus on using the asymptotic distribution.  1/4  H√ on convex sets is O n . Computing p-values for 3.3 Consistency both Sn,n and Tn,n only requires computation over convex sets, so the result is directly applicable. In this section we show that if (a), the restrictions βN ,DN on weights and depth of networks in TF N are 3.2.1 Computational Aspects carefully chosen, (b), the transfer task is not too far from the original task, and (c), the leniency parameter Testing with Sn,m As shown in Theorem 3.1, the η in the training step is small enough, then our pro- null distribution of Sn,m can be approximated as the weighted sum of independent χ2-variables. There are posed test is consistent, meaning the type-2 error rate several approaches to computing the cumulative distri- converges to 0. bution function of this distribution, see Bausch (2013) 0 0 n Theorem 3.2. Let p 6= q, n = n , m = m with m → 1, for an overview and Zhou and Guan (2018) for an im- N = n+m, R0∗ = 1−0 the Bayes error for the transfer plementation. However, computing p-values with this task with 0 > 0, and assume that the following holds: method can be rather costly. 2 βN DN Alternatively, note that the test statistic Sn,m is linear (i) N → 0, βN → ∞ and DN → ∞ for N → ∞ in the number of observations and dimensions. Hence, for the parameters of the function classes TF N , Kirchler, Khorasani, Kloft, Lippert

0 0 (ii) ||p − p ||L1(µ) + ||q − q ||L1(µ) ≤ 2δ, 4 RELATED WORK

(iii) 0 ≤ δ + η < 0, where η ≥ 0 is the leniency In this section, we give an overview over the state- parameter in training the network, and of-the-art in non-parametric two-sample testing for high-dimensional data. (iv) p0 and q0 have bounded support on Rd.† Kernel Methods The methods most related to our method are the kernelized maximum mean discrepancy Then, as N → ∞ both test test statistics (MMD) (Gretton et al., 2012a) and the kernel Fisher S (φ , X , Y ) and T (φ , X , Y ) diverge in n,m N n m n,m N n m discriminant analysis (KFDA) (Harchaoui et al., 2008). probability towards infinity, i.e. for any r > 0 Both methods effectively metricize the space of prob- ability distributions by mapping distribution features Pr (S(φN , Xn, Ym) > r) → 1 and onto mean embeddings in universal reproducing kernel Pr (T (φN , Xn, Ym) > r) → 1. Hilbert spaces (RKHS, (Steinwart and Christmann, 2008)). Test statistics derived from these mean embed- dings can be efficiently evaluated using the kernel trick Sketch of proof (full proof in Appendix A.2). The test (in quadratic time in the number of observations, al- statistics√ Sn,m is lower-bounded by a rescaled version though there are lower-powered linear-time variations). > of N(1 − Rn,m(ψN )), where ψN = wN φN with wN Mean Embeddings (ME) and Smoothed Characteristic selected as in (1). Then, if 1 − Rn,m(ψN ) ≥ c > 0, the Functions (SCF) (Chwialkowski et al., 2015; Jitkrittum test statistic diverges. et al., 2016) are kernel-based linear-time test statistics The finite-sample error R (ψ ) approaches its popu- that are (almost surely) proper metrics on the space n,m N of probability distributions. All four methods rely on lation version R(ψN ) for large n, m, and the difference 0 characteristic kernels to yield consistent tests and are between R(ψN ) and R (ψN ) can be controlled over δ. The rest of the proof is akin to standard consistency closely related. proofs in regression and classification. Namely, we can 0 0∗ Deep Kernel Methods In the context of train- split RN (ψN ) − R into approximation and estimation error and control these via a Universal Approximation ing and evaluating Generative Adversarial Networks Theorem (Hanin, 2017), and Rademacher complexity (GANs), several authors have investigated the use of bounds on the neural network function class (Golowich the MMD with kernels parametrized by deep neural et al., 2017), respectively. networks. In Bińkowski et al. (2018); Li et al. (2017); Arbel et al. (2018), the authors feed features extracted from deep neural networks into characteristic kernels. The main caveat of Theorem 3.2 is that it gives no Jitkrittum et al. (2018) use deep kernels in the con- explicit directions to choose the transfer task p0 and text of relative goodness-of-fit testing without directly 0 q . Whether the respective µ-densities are L1-close to considering consistency aspects of this approach. Ex- the testing densities in general cannot be answered, tensions from the GAN literature to two-sample testing and similarly the Bayes error rate 1 − 0 is not known is not straightforward since statistical consistency guar- beforehand. If abundant data for the testing task is at antees strongly depend on careful selection of the re- hand, then splitting the data is the safe way to go; if spective function classes. To the best of our knowledge, data is scarce, Theorem 3.2 gives justification that a all previous works made simplifying assumptions on reasonably close transfer task will have good power as injectivity or even invertibility of the involved networks. well. In this work we show that a linear kernel on top of The bounded support requirement (iv) on p0 and q0 can transfer-learned neural network feature maps (as has be circumvented as well – by choosing the support large also been done by Xu et al. (2018) for GAN evaluation) 0 0 is not only sufficient for consistency of the test, but also enough one can always just truncate (Xi) and (Yi ) and will still satisfy requirements (ii) and (iii), especially performs considerably better empirically in all settings also in the case of p0 = p and q0 = q with unbounded we analyzed. In addition to that, our test statistics can support. This procedure, however, requires knowledge be directly evaluated in linear instead of quadratic time of where to truncate the transfer distributions. Instead (in the sample size) and the corresponding asymptotic one can also grow the support of p0 and q0 with N; for null distributions can be exactly computed (in contrast more details, see Appendix B. to the MMD & KFDA).

†A similar Theorem holds also for the case of unbounded Classifier Two-Sample Tests (C2ST) First pro- support, see Appendix B posed by Friedman (2003) and then further analyzed by Two-sample Testing Using Deep Learning

Kim et al. (2016) and Lopez-Paz and Oquab (2016), the distances of all data points (MMD-med). The second idea of the C2ST is to utilize a generic classifier, such variant, reported by Gretton et al. (2012b), splits the as a neural network or a k-nearest neighbor approach data in two disjoint sets and selects the bandwidth for the two-sample testing problem. In particular, they that maximizes power on the first set and evaluates split the available data into training and test set, train the MMD on the second set (MMD-opt). We use the a classifier on the training set and evaluate whether the implementation provided by Jitkrittum et al. (2016), performance on the test set exceeds random variation. which estimates the null hypothesis via a Monte Carlo The main drawback of this approach is that the data permutation scheme (we again use M = 1000 permuta- has to be split in two chunks, creating a trade-off: if tions). the training set is too small, the classifier is unlikely For the Smoothed Characteristic Functions (SCF) to find a statistically relevant signal in the data; if the and Mean Embeddings (ME), we select the number training set is large and thus the test set small, the of test locations based on the task and sample size. The C2ST test loses power. locations are selected either randomly (as presented by Our method circumvents the need to split the data in Chwialkowski et al. (2015)) or optimized on half of the training and test set – Theorem 3.2 shows that training data via the procedure described by Jitkrittum et al. on a reasonably close transfer data set is sufficient. (2016). The kernel was either selected using the me- Even more, as shown in Section 3.1.3, our method dian heuristic, or via a grid search as by Chwialkowski can be interpreted as empirical risk minimization with et al. (2015); Jitkrittum et al. (2016). In each case we additional fine-tuning of the last layer on the testing report the kernel and location selection method that data, guaranteed to be as least as good as an equivalent performed best on the given task, with details given method with fixed last layer. in the corresponding paragraphs. Note that for very small sample sizes, both SCF and ME oftentimes do not control the type-1 error rate properly, since they 5 EXPERIMENTS were designed for larger sample sizes. This results in highly variable type-2 error rate for small m in the ex- In this section, we compare our proposed deep learning periments. Again, we use the implementation provided two-sample tests with other state-of-the-art approaches. by Jitkrittum et al. (2016). In addition to these published methods, we also com- 5.1 Experimental setup pare our method against a deep kernel MMD test For the DFDA and DMMD tests we train a deep neu- (k-DMMD), i.e. the MMD test where the output of a ral network on a related task; details will be deferred pretrained neural network gets fed into a Gaussian ker- to the corresponding sections. We report both the per- nel (instead of a linear kernel as in our case). Jitkrittum et al. (2018) used this method for relative goodness- formance of the deep MMD Sn,m where we estimate the null hypothesis via a Monte Carlo permutation of-fit testing instead of two-sample testing. For image sample (Ernst et al., 2004) (we fix M = 1000 resam- data, we select the bandwidth parameter for the Gaus- pling permutations except otherwise noted), and the sian kernel via the heuristic, and for audio data via the power maximization technique (in each case deep FDA statistic Tn,m, for which we use the asymp- 2 the other variant performs considerably worse); the totic χH distribution. As explained in Section 3.2.1, for the DFDA we project the last hidden layer onto pretrained networks are the same as for our tests and Hˆ < H dimensions using a PCA. We found the heuris- the C2ST. q ˆ m+n All experiments were run over 1000 runs. Type-1 error tic H := 2 to perform well across a number of tasks (disjoint from the ones presented in this section). rates are estimated by drawing both samples (without For the DMMD we do not need any dimensionality replacement) from the same class and computing the reduction. We calibrated parameters of both tests on rate of rejections. Similarly, type-2 error rates are esti- data disjoint from the ones that we report results on mated as the rate of not rejecting the null hypothesis in the subsequent sections. when sampling from two distinct classes. All figures of type-1 and type-2 error rates show the 95% confidence For the C2ST, we train a standard logistic regression interval based on a Wilson Score interval (and a “rule-of- on top of the pretrained features extracted from the three“ approximation in the case of 0-values (Eypasch same neural network as for our methods. et al., 1995)). In all settings we fixed the significance For the kernel MMD we report two kernel band- level at α = 0.05. In addition to that we show in Ap- width selection strategies for the Gaussian kernel. The pendix D.3 empirically that also for smaller significance first variant is the “median distance“ heuristic (Gretton levels high power can be preserved. Preprocessing for et al., 2012a) which selects the median of the euclidean image data is explained in Appendix D.2. Kirchler, Khorasani, Kloft, Lippert

0.10 1.0 DFDA-sup (ours) DMMD-sup (ours) 0.8 C2ST-sup k-DMMD-sup 0.6 MMD-med 0.05 MMD-opt 0.4 SCF ME

Type-1 Error Rate Type-2 Error Rate 0.2 DFDA-unsup (ours) DMMD-unsup (ours)

0.00 0.0 C2ST-unsup 0 100 200 300 400 500 0 100 200 300 400 500 m (per population) m (per population) k-DMMD-unsup (a) Type-1 error rate on AM audio data. (b) Type-2 error rate on AM audio data.

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

Type-2 Error Rate 0.2 Type-2 Error Rate 0.2 Type-2 Error Rate 0.2

0.0 0.0 0.0 50 100 150 200 50 100 150 200 50 100 150 200 m (per population) m (per population) m (per population) (c) Type-2 error rate on aircraft data. (d) Type-2 error rate on KDEF data. (e) Type-2 error rate on dogs data.

Figure 1: Results on AM audio (top row) and natural image (bottom row) data sets. Suffixes “-sup“ indicate supervised pretraining, “-unsup“ indicates unsupervised pretraining.

5.2 Control of Type-1 Error Rate Appendix D.6 for details. Figure 1b reports the results with varying number of Since the presented test procedures are not exact tests 2 it is important to verify that the type-1 error rate is observations under constant noise level σ = 1. Our controlled at the proper level. Figure 1a shows that method shows high power, even at low sample sizes, the empirical type-1 error rate is well controlled for the whereas kernel methods need large amounts of data amplitude modulated audio data introduced in the next to deal with the task. Note that these results are section. For the other data sets, results are provided consistent with the original results in Gretton et al. in Appendix D.4. (2012b), where the authors fixed the sample size at m = 10, 000 and consequently only used the (significantly less powerful) linear-time MMD test. 5.3 Power Analysis

Amplitude Modulated Audio Data Here we an- Aircraft We investigate the Fine-Grained Visual alyze the proposed test on the amplitude modulated Classification of Aircraft data set (Maji et al., 2013). audio example from (Gretton et al., 2012b). The task We select two visually similar aircraft families, namely in this setting is to distinguish snippets from two dif- Boeing 737 and Boeing 747 as populations p and q, ferent songs after they have been amplitude modulated respectively. The neural network embeddings are ex- (AM) and mixed with noise. We use the same pre- tracted from a ResNet-152 (He et al., 2016) trained on processing and amplitude modulation as Gretton et al. ILSVRC (Russakovsky et al., 2015). Figure 1c shows (2012b). We use the freely available music from Gra- that all neural network architectures perform consid- matik (2014); distribution p is sampled from track four, erably better than the kernel methods. Furthermore, distribution q from track five and the remaining tracks our proposed tests can also outperform both the C2ST on the album were used for training the network in a and the deep kernel MMD. multi-class classification setting. As our neural network architecture we use a simple convolutional network, a Facial Expressions The Karolinska Directed Emo- variant from Dai et al. (2017), called M5 therein; see tional Faces (KDEF) data set (Lundqvist et al., 1998) Two-sample Testing Using Deep Learning

Table 1: Results on neuroimaging data, comparing subjects who are cognitive normal (CN), have mild cog- nitive impairment (MCI) or have Alzheimer’s disease (AD). APOE has neutral variant ε3 and risk-factor variant ε4. Numbers in parentheses denote sample size.

X (# obs) Y (# obs) p-value CN (490) AD (314) 9.49 · 10−5 CN (490) MCI (287) 2.44 · 10−4 MCI (287) AD (314) 1.45 · 10−3 APOE ε3 (811) APOE ε4 (152) 1.40 · 10−2

Figure 2: Slices of 3D-MRI scans of an Alzheimer’s has been previously used by Jitkrittum et al. (2016); disease patient (A) and a cognitively normal individ- Lopez-Paz and Oquab (2016). The task is to distin- ual (B). Note the enlargement of the lateral ventricles guish between faces showing positive (happy, neutral, (indicated by red arrows) in the Alzheimer’s disease surprised) and negative (afraid, angry, disgusted) emo- patient. tions. The feature embeddings are again obtained from a ResNet-152 trained on ILSVRC. Results can be found in Figure 1d. Even though the images in ImageNet and KDEF are very different, the neural network tests test data without the need to perform a data split. again outperform the kernel methods. Also note that the apparent advantage of the mean embedding test Three-dimensional Neuroimaging Data In this for low sample sizes is due to an unreasonably high section, we apply the DFDA test procedure to 3D type-1 error rate (> 0.11 and > 0.085 at m = 10, 15, Magnetic Resonance Imaging (MRI) scans and genetic respectively). information from the Alzheimer’s Disease Neuroimag- ing Initiative (ADNI) (Mueller et al., 2005). To this Stanford Dogs Lastly, we evaluate our tests on the end, we transfer a 3D convolutional autoencoder that Stanford Dogs data set (Khosla et al., 2011), consisting has been trained on MRI scans from the Brain Ge- of 120 classes of different dog breeds. As test classes nomics Superstruct Project (Holmes et al., 2015) to we select the dog breeds ‘Irish wolfhound‘ and ‘Scot- perform statistical testing on the ADNI data. Details tish deerhound‘, two breeds that are visually extremely on preprocessing and network architecture are provided similar. Since the data set is a subset of the ILSVRC in Appendix D.10. data, we cannot train the networks on the whole Im- ageNet data again. Instead, we train a small 6-layer The ADNI dataset consists of individuals diagnosed convolutional neural network on the remaining 118 with Alzheimer’s Disease (AD), with Mild Cognitive classes in a multi-class classification setting and use Impairment (MCI), or as cognitively normal (CN); the embedding from the last hidden layer. To show Figure 2 shows exemplaric images of an AD and a that our tests can also work with unsupervised transfer- CN subject. Table 1 shows that our test can detect learning, we also train a convolutional autoencoder on statistically significant differences between MRI scans this data; the encoder part is identical to the super- of individuals with a different diagnosis. Additionally, vised CNN, see Appendix D.7 for details. Note that we evaluate whether our test can detect differences for this setting, the theoretical consistency guarantees between individuals who have a known genetic risk from Theorem 3.2 do not hold, although the type-1 factor for neurodegenerative diseases and individuals error rate is still asymptotically controlled. Figure 1e without that risk factor. In particular, we compare the reports the results, with *-sup denoting the supervised, two variants ε3 (the “normal” variant) and ε4 (the risk- and *-unsup the unsupervised transfer-learning task. factor variant) in the Apolipoprotein E (APOE) gene, As expected, tests based on the supervised embedding which is related to AD and other diseases (Corder et al., approach outperform other tests by a large margin. 1993). By grouping subjects according to which variant However, the unsupervised DMMD and DFDA still they exhibit we test for statistical dependence between outperform kernel-based tests. Interestingly, both the a (binary) genetic mutation and (continuous) variation C2ST and the k-DMMD method seem to suffer more in 3D MRI scans. Table 1 shows that individuals with severely from the mediocre feature embedding than our ε4 and ε3 APOE variants are significantly different, tests. One potential explanation for this phenomenon suggesting a statistical dependence between genetic is the ability of DMMD and DFDA to fine-tune on the variation and structural brain features. Kirchler, Khorasani, Kloft, Lippert

Acknowledgements Samples from the National Cell Repository for AD (NCRAD), which receives government support under a The authors thank Stefan Konigorski and Jesper Lund cooperative agreement grant (U24 AG21886) awarded for helpful discussions and comments. Marius Kloft by the National Institute on Aging (AIG), were used acknowledges support by the German Research Foun- in this study. Funding for the WGS was provided by dation (DFG) award KL 2698/2-1 and by the Federal the Alzheimer’s Association and the Brin Wojcicki Ministry of Science and Education (BMBF) awards Foundation. 031L0023A, 01IS18051A, and 031B0770E. Part of the work was done while Marius Kloft was a sabbatical References visitor of the DASH Center at the University of South- Michael Arbel, Dougal Sutherland, Mikołaj Bińkowski, ern California. This work has been funded by the and Arthur Gretton. On gradient regularizers for Federal Ministry of Education and Research (BMBF, mmd gans. In Advances in Neural Information Pro- Germany) in the project KI-LAB-ITSE (project num- cessing Systems, pages 6700–6710, 2018. ber 01|S19066). Johannes Bausch. On the efficient calculation of a Data used in the preparation of this article were linear combination of chi-square random variables obtained from the Alzheimer’s Disease Neuroimaging with an application in counting string vacua. Journal Initiative (ADNI) database adni.loni.usc.edu. of Physics A: Mathematical and Theoretical, 46(50): As such, the investigators within the ADNI con- 505202, 2013. tributed to the design and implementation of ADNI Vidmantas Bentkus. A lyapunov-type bound in rd. and/or provided data but did not participate in Theory of Probability & Its Applications, 49(2):311– analysis or writing of this report. A complete 323, 2005. listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/ Mikołaj Bińkowski, Dougal J Sutherland, Michael Ar- how_to_apply/ADNI_Acknowledgement_List.pdf. bel, and Arthur Gretton. Demystifying mmd gans. and sharing of ADNI was funded by the arXiv preprint arXiv:1801.01401, 2018. Alzheimer’s Disease Neuroimaging Initiative (ADNI) Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdi- (National Institutes of Health Grant U01 AG024904) novic, and Arthur Gretton. Fast two-sample testing and DOD ADNI (Department of Defense award with analytic representations of probability measures. number W81XWH-12-2-0012). ADNI is funded by the In Advances in Neural Information Processing Sys- National Institute on Aging, the National Institute of tems, pages 1981–1989, 2015. Biomedical Imaging and Bioengineering, and through EH Corder, AM Saunders, WJ Strittmatter, generous contributions from the following: Alzheimer’s DE Schmechel, PC Gaskell, GW Small, AD Roses, Association; Alzheimer’s Drug Discovery Foundation; JL Haines, and MA Pericak-Vance. Gene dose BioClinica Inc; Biogen Idec Inc; Bristol-Myers Squibb of apolipoprotein e type 4 allele and the risk of Company; Eisai Inc; Elan Pharmaceuticals Inc; Eli alzheimer’s disease in late onset families. Science, Lilly and Company; F. Hoffmann-La Roche Ltd and 261(5):921–923, 1993. its affiliated company Genentech Inc; GE Healthcare; Innogenetics N.V.; IXICO Ltd; Janssen Alzheimer Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samar- Immunotherapy Research & Development LLC; jit Das. Very deep convolutional neural networks for Johnson & Johnson Pharmaceutical Research & raw waveforms. In 2017 IEEE International Con- Development LLC; Medpace Inc; Merck & Co Inc; ference on Acoustics, Speech and Signal Processing Meso Scale Diagnostics LLC; NeuroRx Research; (ICASSP), pages 421–425. IEEE, 2017. Novartis Pharmaceuticals Corporation; Pfizer Inc; Luc Devroye, László Györfi, and Gábor Lugosi. A Piramal Imaging; Servier; Synarc Inc; and Takeda probabilistic theory of pattern recognition, volume 31. Pharmaceutical Company. The Canadian Institutes of Springer Science & Business Media, 2013. Health Research is providing funds to support ADNI Rick Durrett. Probability: theory and examples, vol- clinical sites in Canada. Private sector contributions ume 49. Cambridge university press, 2019. are facilitated by the Foundation for the National Institutes of Health (http://www.fnih.org). The Michael D Ernst et al. Permutation methods: a basis grantee organization is the Northern California for exact inference. Statistical Science, 19(4):676–685, Institute for Research and Education, and the study is 2004. coordinated by the Alzheimer’s Disease Cooperative Ernst Eypasch, Rolf Lefering, CK Kum, and Hans Study at the University of California, San Diego. Troidl. Probability of adverse events that have not ADNI data are disseminated by the Laboratory for yet occurred: a statistical reminder. Bmj, 311(7005): Neuro Imaging at the University of Southern California. 619–620, 1995. Two-sample Testing Using Deep Learning

Jerome Friedman. On multivariate goodness-of-fit and Aditya Khosla, Nityananda Jayadevaprakash, Bang- two-sample testing. In Statistical Problems in Parti- peng Yao, and Li Fei-Fei. Novel dataset for fine- cle Physics, Astrophysics, and Cosmology, page 311, grained image categorization. In First Workshop on 2003. Fine-Grained Visual Categorization, IEEE Confer- Noah Golowich, Alexander Rakhlin, and Ohad Shamir. ence on Computer Vision and Pattern Recognition, Size-independent sample complexity of neural net- Colorado Springs, CO, June 2011. works. arXiv preprint arXiv:1712.06541, 2017. Ilmun Kim, Aaditya Ramdas, Aarti Singh, and Larry Gramatik. The age of reason. http:// Wasserman. Classification accuracy as a proxy for dl.lowtempmusic.com/Gramatik-TAOR.zip, 2014. two sample testing. arXiv preprint arXiv:1602.02210, [Online; accessed May/23/2019]. 2016. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Diederik P Kingma and Jimmy Ba. Adam: A Bernhard Schölkopf, and Alexander Smola. A ker- method for stochastic optimization. arXiv preprint nel two-sample test. Journal of Machine Learning arXiv:1412.6980, 2014. Research, 13(Mar):723–773, 2012a. Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science & Business Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Media, 2006. Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming kernel choice for large-scale two-sample tests. In Yang, and Barnabás Póczos. Mmd gan: Towards Advances in neural information processing systems, deeper understanding of matching network. pages 1205–1213, 2012b. In Advances in Neural Information Processing Sys- tems, pages 2203–2213, 2017. Boris Hanin. Universal function approximation by deep neural nets with bounded width and relu activations. David Lopez-Paz and Maxime Oquab. Revisit- arXiv preprint arXiv:1708.02691, 2017. ing classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016. Zaïd Harchaoui, Francis R Bach, and Èric Moulines. Testing for with kernel fisher discrim- Jie Lu, Vahid Behbood, Peng Hao, Hua Zuo, Shan inant analysis. In Advances in Neural Information Xue, and Guangquan Zhang. Transfer learning using Processing Systems, pages 609–616, 2008. computational intelligence: a survey. Knowledge- Based Systems, 80:14–23, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Daniel Lundqvist, Anders Flykt, and Arne Öhman. The In Proceedings of the IEEE conference on computer karolinska directed emotional faces (kdef). CD ROM vision and pattern recognition, pages 770–778, 2016. from Department of Clinical Neuroscience, Psychol- ogy section, Karolinska Institutet, 91:630, 1998. Avram J Holmes, Marisa O Hollinshead, Timothy M O’Keefe, Victor I Petrov, Gabriele R Fariello, S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and Lawrence L Wald, Bruce Fischl, Bruce R Rosen, A. Vedaldi. Fine-grained visual classification of air- Ross W Mair, Joshua L Roffman, et al. Brain ge- craft. Technical report, 2013. nomics superstruct project initial data release with Bettina Mieth, Marius Kloft, Juan Antonio Rodríguez, structural, functional, and behavioral measures. Sci- Sören Sonnenburg, Robin Vobruba, Carlos Morcillo- entific data, 2:150031, 2015. Suárez, Xavier Farré, Urko M Marigorta, Ernst Fehr, Sergey Ioffe and Christian Szegedy. Batch normal- Thorsten Dickhaus, et al. Combining multiple hy- ization: Accelerating deep network training by pothesis testing with machine learning increases the reducing internal covariate shift. arXiv preprint statistical power of genome-wide association studies. arXiv:1502.03167, 2015. Scientific reports, 6:36671, 2016. Wittawat Jitkrittum, Zoltán Szabó, Kacper P Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal- Chwialkowski, and Arthur Gretton. Interpretable dis- walkar. Foundations of machine learning. MIT press, tribution features with maximum testing power. In 2018. Advances in Neural Information Processing Systems, Susanne G Mueller, Michael W Weiner, Leon J Thal, pages 181–189, 2016. Ronald C Petersen, Clifford Jack, William Jagust, Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn John Q Trojanowski, Arthur W Toga, and Laurel Sangkloy, James Hays, Bernhard Schölkopf, and Beckett. The alzheimer’s disease neuroimaging ini- Arthur Gretton. Informative features for model com- tiative. Neuroimaging Clinics, 15(4):869–877, 2005. parison. In Advances in Neural Information Process- J Neyman and ES Pearson. On the problem of the most ing Systems, pages 808–819, 2018. efficient tests of statistical hypotheses. Philosophical Kirchler, Khorasani, Kloft, Lippert

Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231:289–337, 1933. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. Lutz Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer, 1998. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan- der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Ingo Steinwart and Andreas Christmann. Support vec- tor machines. Springer Science & Business Media, 2008. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California In- stitute of Technology, 2011. Qiantong Xu, Gao Huang, Yang Yuan, Chuan Guo, Yu Sun, Felix Wu, and Kilian Weinberger. An em- pirical study on evaluation metrics of generative ad- versarial networks. arXiv preprint arXiv:1806.07755, 2018. Hao Zhou, Vamsi K Ithapu, Sathya Narayanan Ravi, Vikas Singh, Grace Wahba, and Sterling C Johnson. Hypothesis testing in unsupervised domain adap- tation with applications in alzheimer’s disease. In Advances in neural information processing systems, pages 2496–2504, 2016. Quan Zhou and Yongtao Guan. On the null distribu- tion of bayes factors in . Journal of the American Statistical Association, 113(523): 1362–1371, 2018.