A Test for the Two-Sample Problem Using Mutual Information to Fix Information Leak in E-Passports

A Test for the Two-Sample Problem using Mutual Information to Fix Information Leak in e-Passports Apratim Guha ∗ School of Mathematics, University of Birmingham, Birmingham, U.K. and Tom Chothia School of Computer Science, University of Birmingham, Birmingham, U.K. February 8, 2011 Abstract For two independently drawn samples from continuous distributions, a statistical test based on the mutual information statistic is proposed for the null hypothesis that both the samples originate from the same population. It is demonstrated through simulation that this test is more powerful than the commonly used nonparametric tests in many situations. As an application we discuss the information leak of e-passports. Using the mutual information based tests, we show that it is possible to trace a passport by comparing the time taken to respond to a recorded message. We establish that the mutual information based tests detect differences where other tests fail. We also explore the effect of adding an artifical fixed- time delay in specific situations to stop the information leak, and observe that the mutual information based test detects a leak in a situation where the other non-parametric tests fail. Keywords: Nonparametric Test, Information Theory, Test of Independence, Kernel Density Esti- matation, Bandwidth, Anonymity. ∗We are grateful to Y. Xiao for his comments on the manuscript as well as his help with the coding. The main theorem in this paper derives from of A. Guha's Ph.D. thesis supervised by D. Brillinger, whom he thanks for his encouragement, help and guidance. 1 1 Introduction The motivation of this work arises from an anonymity problem in computer science. Chothia and Smirnov (2010) discusses a time-based tracebility attack on an e-passport where by eavesdropping on its communication with a reader it is possible to trace the passport later. In that work, one session between the passport and a legitimate reader was recorded, and then by comparing the response time to a previously recorded message, a leak was inferred on the basis of a visual inspec- tion of the plot of the response times for the same and a different passport. However, no quantative measure was used. To quantify the difference of the response time for different passports, we intro- duce a mutual information (MI) based test statistic to compare two independently drawn samples from continuous distributions to test the null hypothesis that both samples originate from the same population. When the underlying distributions are continuous, several well-known non-parametric tests are available. Most of them are based on the empirical distribution function: the Kolmogorov- Smirnov (KS) test, the Wilcoxon test (Gibbons and Chakraborti 2010), the Anderson-Darling (AD) test (Pettitt 1976) and the Cramér-von Mises (CVM) test (Anderson 1962) are some of the most-popular ones. A two-sample t-test can be used when a difference in location is suspected. A modification of the Wilcoxon test proposed by Baumgartner, Weiß and Schindler (1998), henceforth referred to as the BWS test, has superior power compared to the CVM and the KS tests in a wide variety of situations. However, it should be used with caution as this test does not control the type I error rate in some cases, see Neuhäuser(2005). Among the available parametric and the semiparametric choices for two sample tests, Zhang (2006) and Wylupek (2010) provide two newest examples: the former introduces a likelihood-ratio based parametric test, the latter discusses a \data-driven" semi-parametric test. However, as we may observe from the density estimates of the response times of various passports in Chothia and Smirnov (2010), parametric or semiparametric distribution-based apporach would not work well for the passports as some of them have bimodal response times and some do not. Hence in this article we limit our discussion to non-parametric tests only. For distributions with the same location parameter but different otherwise, the t test or the Wilcoxon test do not work well. Such a situation arises during the analysis of e-passport information leak in section 4. The AD test, CVM test, the KS test and the BWS test do work reasonably 2 well in such situations; however, we will see in sections 3 and 4 that the MI test works better than these tests in many situations. Examples of applications of various two-sample tests in analysing computer science data can be found in literature, for a recent example see Jeske, Lockhart, Stephens and Zhang (2008). The application of MI in computer science data analysis is also popular, some recent examples are Alvim, Andrésand Palamidessi (2010), Chatzikokolakis, Palamidessi and Panangden (2008) and Chatzikokolakis, Chothia and Guha (2010). Applications also exist in other areas of science, for some examples see Paninski (2003), Biswas and Guha (2010) and the references within. MI has also been used in the time series context, for example see Brillinger (2004) and Brillinger & Guha (2007). However, as far as we know, a two-sample test based on MI has never been used before. To fix ideas, let us start with k random variables X1;X2; ··· ;Xk with joint density pX1X2···Xk (·) with respect to some measure µ. Shannon (1948) introduced the concept of mutual information (he called it `relative entropy'), defined as Z pX1X2···Xk (x) IX1;X2;··· ;Xk = log2 pX1X2···Xk (x) dµ(x) (1) p (x)>0 pX1 (x1)pX2 (x2) ··· pX (xk) X1X2···Xk k where x = (x1; x2; ··· ; xk) and pXj (·); j = 1; ··· ; k are the marginal densities of X1;X2; ··· ;Xk respectively. Notice that IX1;X2;··· ;Xk = 0 if the variables are independent. It may be noted here that it is customary in the area of information theory and computer science to consider the logarithm in entropic measures to the base 2. We are following that convention in this paper; and henceforth the base in a logarithmic expression, unless mentioned otherwise, should be understood to be 2. When the joint distribution is continuous, the Lebesgue measure can be used as the dominating measure and in a discrete setup, one may use the counting measure. In a hybrid setup, i.e. when some random variables are discrete and the rest are continuous, µ is an appropriate product of the two measures. The mutual information (MI) is non-negative; it is zero when the random variables are mutually independent and attains its maximum when the concerned random variables have a perfect functional relationship (Cover and Thomas (1991), Biswas and Guha (2009)). Hence it is a very useful extension of the correlation techniques which is only useful to study linear dependence. As a special case of (1), the MI statistics for two random variables X and Y with joint density function pXY (x; y) with respect to some dominating measure µ may be defined as 3 ZZ pXY (x; y) IXY = log pXY (x; y) dµ(x; y); (2) pXY (x;y)>0 pX (x)pY (y) where pX (x) and pY (y) are the respective marginals. Now, consider a hybrid pair (X; Y ) where X is a binary random variable and Y is a continuous random variable. We may write Z p(0; y) Z p(1; y) IXY = log p(0; y) dy + log p(1; y)dy; (3) y:p(0;y)>0 p0p(y) y:p(1;y)>0 p1p(y) where the joint density parameters are defined as P [X = 0; y < Y < y + dy] = p(0; y) dy P [X = 1; y < Y < y + dy] = p(1; y) dy (4) and the order 1 parameters are given by P [y < Y < y + dy] = p(y)dy; P [X = 1] = P (1) = p1 P [X = 0] = p0 = 1 − p1; (5) so that p(y) = p(0; y) + p(1; y). In this paper, we utilize the form of the MI statistic as described in (4) to assess the independence of two samples. Towards that, let us denote the two samples by Y0 := fY01;Y02; ··· ;Y0ng and Y1 := fY11;Y12; ··· ;Y1mg. We assume here that the samples are from two continuous distributions: Y01;Y02; ··· ;Y0n are independently sampled from the distribution F0, and further Y11;Y12; ··· ;Y1m are independentently sampled from the distribution F1. Let us set the null hypothesis H0: F0 = F1 and the alternative H1: F0 6= F1. The idea of this test comes from the fact that the MI between two random variables is zero only when they are independent. To utilise this idea, we combine Y0 and Y1 in one single vector Y, and create a 0-1 valued vector X, whose j-th element, say Xj, is 0 or 1 according to whether the jth element of Y , say Yj, is from Y0 or Y1. In other words, X is a vector of length N := (n + m) with n zeroes followed by m 1s. Under H0, Y would be independent of X in the sense that whether Xj is 0 or 1 will have no bearing on the value of Yj and hence the estimated MI between Y and 0 0 X would differ from the estimated MI between Y , a typical sample of length N from F0 and X , a typical sample of length N from the Bernoulli distribution with P (1) = p1 = m=N only due to 4 random error. We may note here that as F0 and the above mentioned Bernoulli distribution are independent, the true MI between these two distributions is 0. Now under H1, X is completely fixed by our choice of sample, and hence it is clearly related with Y. Hence the MI is higher, and so should typically be the estimated MI.

Load more