Two-Sample Testing Using Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Two-sample Testing Using Deep Learning Matthias Kirchler1;2 Shahryar Khorasani1 Marius Kloft2;3 Christoph Lippert1;4 1Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany 2Technical University of Kaiserslautern, Germany 3University of Southern California, Los Angeles, United States 4Hasso Plattner Institute for Digital Health at Mount Sinai, New York, United States Abstract Two-sample tests are a pillar of applied statistics and a standard method for analyzing empirical data in the We propose a two-sample testing procedure sciences, e.g., medicine, biology, psychology, and so- based on learned deep neural network repre- cial sciences. In machine learning, two-sample tests sentations. To this end, we define two test have been used to evaluate generative adversarial net- statistics that perform an asymptotic loca- works (Bińkowski et al., 2018), to test for covariate tion test on data samples mapped onto a shift in data (Zhou et al., 2016), and to infer causal hidden layer. The tests are consistent and relationships (Lopez-Paz and Oquab, 2016). asymptotically control the type-1 error rate. There are two main types of two-sample tests: paramet- Their test statistics can be evaluated in lin- ric and non-parametric ones. Parametric two-sample ear time (in the sample size). Suitable data tests, such as the Student’s t-test, make strong assump- representations are obtained in a data-driven tions on the distribution of the data (e.g. Gaussian). way, by solving a supervised or unsupervised This allows us to compute p-values in closed form. How- transfer-learning task on an auxiliary (poten- ever, parametric tests may fail when their assumptions tially distinct) data set. If no auxiliary data on the data distribution are invalid. Non-parametric is available, we split the data into two chunks: tests, on the other hand, make no distributional as- one for learning representations and one for sumptions and thus could potentially be applied in a computing the test statistic. In experiments wider range of application scenarios. Computing non- on audio samples, natural images and three- parametric test statistics, however, can be costly as it dimensional neuroimaging data our tests yield may require applying re-sampling schemes or comput- significant decreases in type-2 error rate (up ing higher-order statistics. to 35 percentage points) compared to state- of-the-art two-sample tests such as kernel- A non-parametric test that gained a lot of attention methods and classifier two-sample tests.∗ in the machine-learning community is the kernel two- sample test and its test statistic: the maximum mean discrepancy (MMD). MMD computes the average dis- 1 INTRODUCTION tance of the two samples mapped into the reproducing kernel Hilbert space (RKHS) of a universal kernel (e.g., For almost a century, statistical hypothesis testing Gaussian kernel). MMD critically relies on the choice has been one of the main methodologies in statistical of the feature representation (i.e., the kernel function) inference (Neyman and Pearson, 1933). A classic prob- and thus might fail for complex, structured data such lem is to validate whether two sets of observations are as sequences or images, and other data where deep drawn from the same distribution (null hypothesis) or learning excels. not (alternative hypothesis). This procedure is called Another non-parametric two-sample test is the classifier two-sample test. two-sample test (C2ST). C2ST splits the data into two ∗We provide code at https://github.com/mkirchler/ chunks, training a classifier on one part and evaluating deep-2-sample-test it on the remaining data. If the classifier predicts significantly better than chance, the test rejects the Proceedings of the 23rdInternational Conference on Artificial null hypothesis. Since a part of the data set needs to Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. be put aside for training, not the full data set is used PMLR: Volume 108. Copyright 2020 by the author(s). for computing the test statistic, which limits the power Two-sample Testing Using Deep Learning of the method. Furthermore, the performance of the a two-sample test on data drawn from p and q, i.e. method depends on the selection of the train-test split. we test the null hypothesis H0 : p = q against the alternative H : p 6= q. p0 and q0 are assumed to be in In this work, we propose a two-sample testing proce- 1 some sense similar to p and q, respectively, and act as dure that uses deep learning to obtain a suitable data auxiliary task for tuning the test (the case of p = p0 and representation. It first maps the data onto a hidden- q = q0 is perfectly valid, in which case this is equivalent layer of a deep neural network that was trained (in an to a data splitting technique). unsupervised or supervised fashion) on an independent, 0 auxiliary data set, and then it performs a location test. We have access to four (independent) sets Xn; Yn; Xn0 , 0 0 0 Thus we are able to work on any kind of data that neu- and Yn0 of observations drawn from p; q; p , and q , d ral networks can work on, such as audio, images, videos, respectively. Here Xn = fX1;:::;Xng ⊂ R and Xi ∼ 0 time-series, graphs, and natural language. We propose p for all i (analogue definitions hold for Yn; Xn0 , and 0 two test statistics that can be evaluated in linear time Yn0 ). Empirical averages with respect to a function f (in the number of observations), based on MMD and 1 Pn are denoted by f(Xn) := n i=1 f(Xi). Fisher discriminant analysis, respectively. We derive asymptotic distributions of both test statistics. Our We investigate function classes of deep ReLU networks theoretical analysis proves that the two-sample test with a final tanh activation function: d H procedure asymptotically controls the type-1 error rate, TF N := tanh ◦WD−1 ◦ σ ◦ ::: ◦ σ ◦ W1 : R ! R has asymptotically vanishing type-2 error rate and is W 2 RH×d;W 2 RH×H for j = 2;:::;D − 1; robust both with respect to transfer learning and ap- 1 j 9 proximate training. D−1 Y = jjW jj ≤ β ;D ≤ D We empirically evaluate the proposed methodology in j F ro N N j=1 ; a variety of applications from the domains of computa- tional musicology, computer vision, and neuroimaging. Here, the activation functions tanh and σ(z) := In these experiments, the proposed deep two-sample ReLU(z) = max(0; z) are applied elementwise, jj · jjF ro tests consistently outperform the closest competing is the Frobenius norm, H = d + 1 is the width and method (including deep kernel methods and C2STs) by DN and βN are depth and weight restrictions onto the up to 35 percentage points in terms of the type-2 error networks. This can be understood as the mapping onto rate, while properly controlling the type-1 error rate. the last hidden layer of a neural network concatenated with a tanh activation. 2 PROBLEM STATEMENT & NOTATION 3 DEEP TWO-SAMPLE TESTING In this section, we propose two-sample testing based on We consider non-parametric two-sample statistical test- two novel test statistics, the Deep Maximum Mean ing, that is, to answer the question whether two samples Discrepancy (DMMD) and the Deep Fisher Dis- are drawn from the same (unknown) distribution or not. criminant Analysis (DFDA). The test asymptoti- We distinguish between the case that the two samples cally controls the type-1 error rate, and it is consistent are drawn from the same distribution (the null hypoth- (i.e., the type-2 error rate converges to 0). Further- esis, denoted by H ) and the case that the samples 0 more, we will show that consistency is preserved under are drawn from different distributions (the alternative both transfer learning on a related task, as well as only hypothesis H ). 1 approximately solving the training step. We differentiate between type-1 errors (i.e,rejecting the null hypothesis although it holds) and type-2 errors (i.e., 3.1 Proposed Two-sample Test not rejecting H0 although it does not hold). We strive for both the type-1 error rate to be upper bounded by Our proposed test consists of the following two steps. some significance level α, and the type-2 error rate to 1. We train a neural network over an auxiliary training converge to 0 for unlimited data. The latter property is data set. 2. We then evaluate the maximum mean called consistency and means that with sufficient data, discrepancy test statistic (Gretton et al., 2012a) (or the test can reliably distinguish between any pair of a variant of it) using as kernel the mapping from the probability distributions. input domain onto the network’s last hidden layer. 0 0 d Let p; q; p and q be probability distributions on R 3.1.1 Training Step with common dominating Borel measure µ. We abuse 0 0 0 notation somewhat and denote the densities with re- Let the training data be Xn0 and Ym0 . Denote N = n + spect to µ also by p; q; p0 and q0. We want to perform m0. We run a (potentially inexact) training algorithm Kirchler, Khorasani, Kloft, Lippert to find φN 2 T F N with: Interpretation as Empirical Risk Minimization 0 0 0 0 0 0 0 1 If we identify X with (Z ; 1) and Y with (Z 0 ; −1) n m i i i n +i 1 X 0 X 0 in a regression setting, this is equivalent to an (inexact) @ φN (Xi) − φN (Yi )A + η N empirical risk minimization with loss function L(t; t^) = i=1 i=1 1 − tt^: 0 0 0 1 n m 1 X 0 X 0 N N ≥ max @ φ(Xi) − φ(Yi )A : 1 X 0 0 1 X 0 > 0 φ2T F N N max tiφ(Zi) = max max tiw φ(Zi); i=1 i=1 φ N φ jjw||≤1 N i=1 i=1 Here, η ≥ 0 is a fixed leniency parameter (independent which is equivalent to of N); finding true global optima in neural networks is a hard problem, and an η > 0 allows us to settle N 0 > 1 X 0 > 0 with good-enough, local solutions.