<<

A Test for the Two-Sample Problem using Mutual to Fix Information Leak in e-Passports

Apratim Guha ∗ School of Mathematics, University of Birmingham, Birmingham, U.K. and Tom Chothia School of Computer Science, University of Birmingham, Birmingham, U.K. February 8, 2011

Abstract For two independently drawn samples from continuous distributions, a statistical test based on the mutual information statistic is proposed for the null hypothesis that both the samples originate from the same population. It is demonstrated through simulation that this test is more powerful than the commonly used nonparametric tests in many situations. As an application we discuss the information leak of e-passports. Using the mutual information based tests, we show that it is possible to trace a passport by comparing the time taken to respond to a recorded message. We establish that the mutual information based tests detect differences where other tests fail. We also explore the effect of adding an artifical fixed- time delay in specific situations to stop the information leak, and observe that the mutual information based test detects a leak in a situation where the other non-parametric tests fail.

Keywords: Nonparametric Test, , Test of Independence, Kernel Density Esti- matation, Bandwidth, Anonymity.

∗We are grateful to Y. Xiao for his comments on the manuscript as well as his help with the coding. The main theorem in this paper derives from of A. Guha’s Ph.D. thesis supervised by D. Brillinger, whom he thanks for his encouragement, help and guidance.

1 1 Introduction

The motivation of this work arises from an anonymity problem in computer science. Chothia and Smirnov (2010) discusses a time-based tracebility attack on an e-passport where by eavesdropping on its communication with a reader it is possible to trace the passport later. In that work, one session between the passport and a legitimate reader was recorded, and then by comparing the response time to a previously recorded message, a leak was inferred on the basis of a visual inspec- tion of the plot of the response times for the same and a different passport. However, no quantative measure was used. To quantify the difference of the response time for different passports, we intro- duce a mutual information (MI) based test statistic to compare two independently drawn samples from continuous distributions to test the null hypothesis that both samples originate from the same population. When the underlying distributions are continuous, several well-known non-parametric tests are available. Most of them are based on the empirical distribution function: the Kolmogorov- Smirnov (KS) test, the Wilcoxon test (Gibbons and Chakraborti 2010), the Anderson-Darling (AD) test (Pettitt 1976) and the Cram´er-von Mises (CVM) test (Anderson 1962) are some of the most-popular ones. A two-sample t-test can be used when a difference in location is suspected. A modification of the Wilcoxon test proposed by Baumgartner, Weiß and Schindler (1998), hence- forth referred to as the BWS test, has superior power compared to the CVM and the KS tests in a wide variety of situations. However, it should be used with caution as this test does not control the type I error rate in some cases, see Neuh¨auser(2005). Among the available parametric and the semiparametric choices for two sample tests, Zhang (2006) and Wylupek (2010) provide two newest examples: the former introduces a likelihood-ratio based parametric test, the latter discusses a “data-driven” semi-parametric test. However, as we may observe from the density estimates of the response times of various passports in Chothia and Smirnov (2010), parametric or semiparametric distribution-based apporach would not work well for the passports as some of them have bimodal response times and some do not. Hence in this article we limit our discussion to non-parametric tests only. For distributions with the same location parameter but different otherwise, the t test or the Wilcoxon test do not work well. Such a situation arises during the analysis of e-passport informa- tion leak in section 4. The AD test, CVM test, the KS test and the BWS test do work reasonably

2 well in such situations; however, we will see in sections 3 and 4 that the MI test works better than these tests in many situations. Examples of applications of various two-sample tests in analysing computer science data can be found in literature, for a recent example see Jeske, Lockhart, Stephens and Zhang (2008). The application of MI in computer science data analysis is also popular, some recent examples are Alvim, Andr´esand Palamidessi (2010), Chatzikokolakis, Palamidessi and Panangden (2008) and Chatzikokolakis, Chothia and Guha (2010). Applications also exist in other areas of science, for some examples see Paninski (2003), Biswas and Guha (2010) and the references within. MI has also been used in the context, for example see Brillinger (2004) and Brillinger & Guha (2007). However, as far as we know, a two-sample test based on MI has never been used before.

To fix ideas, let us start with k random variables X1,X2, ··· ,Xk with joint density pX1X2···Xk (·) with respect to some measure µ. (1948) introduced the concept of mutual information (he called it ‘relative ’), defined as Z   pX1X2···Xk (x) IX1,X2,··· ,Xk = log2 pX1X2···Xk (x) dµ(x) (1) p (x)>0 pX1 (x1)pX2 (x2) ··· pX (xk) X1X2···Xk k

where x = (x1, x2, ··· , xk) and pXj (·), j = 1, ··· , k are the marginal densities of X1,X2, ··· ,Xk respectively. Notice that IX1,X2,··· ,Xk = 0 if the variables are independent. It may be noted here that it is customary in the area of information theory and computer science to consider the in entropic measures to the base 2. We are following that convention in this paper; and henceforth the base in a logarithmic expression, unless mentioned otherwise, should be understood to be 2. When the joint distribution is continuous, the Lebesgue measure can be used as the dominating measure and in a discrete setup, one may use the counting measure. In a hybrid setup, i.e. when some random variables are discrete and the rest are continuous, µ is an appropriate product of the two measures. The mutual information (MI) is non-negative; it is zero when the random variables are mutually independent and attains its maximum when the concerned random variables have a perfect functional relationship (Cover and Thomas (1991), Biswas and Guha (2009)). Hence it is a very useful extension of the correlation techniques which is only useful to study linear dependence. As a special case of (1), the MI statistics for two random variables X and Y with joint density function pXY (x, y) with respect to some dominating measure µ may be defined as

3 ZZ   pXY (x, y) IXY = log pXY (x, y) dµ(x, y); (2) pXY (x,y)>0 pX (x)pY (y) where pX (x) and pY (y) are the respective marginals. Now, consider a hybrid pair (X,Y ) where X is a binary and Y is a continuous random variable. We may write Z p(0, y) Z p(1, y) IXY = log p(0, y) dy + log p(1, y)dy, (3) y:p(0,y)>0 p0p(y) y:p(1,y)>0 p1p(y) where the joint density parameters are defined as

P [X = 0, y < Y < y + dy] = p(0, y) dy

P [X = 1, y < Y < y + dy] = p(1, y) dy (4) and the order 1 parameters are given by

P [y < Y < y + dy] = p(y)dy; P [X = 1] = P (1) = p1 P [X = 0] = p0 = 1 − p1; (5) so that p(y) = p(0, y) + p(1, y). In this paper, we utilize the form of the MI statistic as described in (4) to assess the indepen- dence of two samples. Towards that, let us denote the two samples by Y0 := {Y01,Y02, ··· ,Y0n} and Y1 := {Y11,Y12, ··· ,Y1m}. We assume here that the samples are from two continuous dis- tributions: Y01,Y02, ··· ,Y0n are independently sampled from the distribution F0, and further

Y11,Y12, ··· ,Y1m are independentently sampled from the distribution F1. Let us set the null hypothesis H0: F0 = F1 and the alternative H1: F0 6= F1. The idea of this test comes from the fact that the MI between two random variables is zero only when they are independent. To utilise this idea, we combine Y0 and Y1 in one single vector Y, and create a 0-1 valued vector X, whose j-th element, say Xj, is 0 or 1 according to whether the jth element of Y , say Yj, is from Y0 or Y1. In other words, X is a vector of length N := (n + m) with n zeroes followed by m 1s. Under H0, Y would be independent of X in the sense that whether

Xj is 0 or 1 will have no bearing on the value of Yj and hence the estimated MI between Y and 0 0 X would differ from the estimated MI between Y , a typical sample of length N from F0 and X , a typical sample of length N from the Bernoulli distribution with P (1) = p1 = m/N only due to

4 random error. We may note here that as F0 and the above mentioned Bernoulli distribution are independent, the true MI between these two distributions is 0.

Now under H1, X is completely fixed by our choice of sample, and hence it is clearly related with Y. Hence the MI is higher, and so should typically be the estimated MI. Therefore we reject

H0 if the estimated MI statistic is ‘large’. A more precise rejection criterion is discussed in Section 2. It can easily be shown that the MI between a continuous random variable and a binary random variable is bounded by 1, and hence values of MI close to 1 suggest a high degree of dependence between X and Y, which in turn may be considered as a strong evidence against H0. The rest of the paper is divided as follows. We develop the estimates of the mutual informa- tion based test statistic using kernel density estimates and its properties in Section 2. We also provide the asymptotic distributions of the mutual information statistic under certain regularity conditions, and describe the test procedure with greater details. In Section 3 we compare MI tests with the previously mentioned existing tests through simulation. The passport data and our experiments to explore the information leaks are discussed in Section 4. Section 5 concludes.

2 Mutual Information and the Test Statistic

In this section we introduce an estimate of the mutual information statistic between one discrete and one continuous random variable, describe its asymptotic distribution under some regularity conditions when the two variables are independent and finally discuss the construction of the critical region of the test based on MI obtained in Section 1.

2.1 Estimation of Mutual Information

An estimate of the MI between X and Y, IXY , can be used as a check for dependence by testing whether IXY is significantly different from zero. Many competing estimates exist, for some ex- amples see Moddemeijer (1989), Paninski (2003) and Antos and Kontoyiannis (2001). A popular choice is the “plug-in” estimate, Antos and Kontoyiannis (2001). It is obtained by substituting suitable density estimates into (2): ZZ   ˆ pˆXY (x, y) IXY = log pˆXY (x, y) dµ(x, y). (6) (x,y):ˆpXY (x,y)>0 pˆX (x)ˆpY (y)

5 In the hybrid situation, (6) reduces to Z   Z   ˆ pˆ(0, y) pˆ(1, y) IXY = log pˆ(0, y) dy + log pˆ(1, y) dy. (7) y:ˆp(0,y)>0 pˆ0pˆ(y) y:ˆp(1,y)>0 pˆ1pˆ(y)

The obvious choice forp ˆ1 is the sample proportion of 1’s, i.e. N 1 X pˆ = χ 1 N {Xi=1} i=1 where χB is the indicator function of a set B. The continuous density estimates may be estimated using a suitable kernel K as follows N   1 X Yi − y pˆ(y) = K ; Nh h N i=1 N N   1 X Yi − y pˆ(1, y) = K χ , pˆ(0, y) =p ˆ(y) − pˆ(1, y), (8) Nh h {Xi=1} N i=1 N where hN is an appropriately chosen bandwidth. We discuss the choice of kernels and bandwidths in Section 2.2.

2.2 A Distribution of the Mutual Information Statistic Under Inde- pendence

It is known that there is no universal rate at which the error in estimation of MI goes to zero, no matter what estimator we pick, see Antos and Kontoyiannis (2001) andPaninski (2003). However, a better and more positive result can be obtained under reasonable regularity conditions for a smaller class of distributions. We now derive such a result.

Let us assume that the pairs {Xi,Yi}, 1 ≤ i ≤ N are independent and identically distributed (IID) satisfying

A1. Yi’s are bounded continuous real-valued random variables with finite support;

A2. For u = 0, 1, p(u, y) has a continuous bounded second derivative in y;

A3. K has a finite support symmetric around zero, and integrates to 1.

2 4 A4. hN → 0, NhN → ∞ and NhN → 0 as N → ∞.

6 The subscript N on hN will be omitted henceforth. Under the null hypothesis

H0 : X and Y are independent, (9)

ˆ a large sample distribution for IXY may given by the following theorem.

−1/2 ˆ Theorem 1. Under H0 and assumptions A1-A4, Nh (IXY /log (e) − C1/(Nh)) converges to a R R 2 R normal distribution with mean 0 and C2 = 0.5 ( K(w)K(v + w)dw) dv χp(y)>0 dy as R 2 R N → ∞, where C1 = 0.5 K (v) dv χp(y)>0 dy.

An outline of the proof is given in Appendix A.

One may utilize the large sample distributions of the MI statistics from Theorem 1 to test H0 against the alternative H1 : X and Y are not independent. (10)

Notice that H0 is equivalent to saying that IXY = 0, and similarly H1 is equivalent to saying that

IXY > 0. The idea of Theorem 1 is similar to Proposition 1 of Fernandes and Neri (2008), which discusses the large sample distribution of a generalised entropic measure between two continuous processes in the time series situation, of which the MI is a specific case. Similar to their result, the asymptotic distribution of the MI statistic depends on X and Y only through the length of support of Y in the hybrid situation as well. Whereas this may be considered as an advantage, for Theorem 1 to apply Y needs to be bounded. However, in real life situation this problem may not be too critical as we rarely observe a distribution with infinite support in practice. The presence of bias, which grows as h−1/2 in the hybrid setup, is one of the known problems of the MI estimates. It renders the estimation of bias essential. For simple kernels and for bounded distributions whose length of support is known, Theorem 1 can be used to estimate the bias. In more general situations, where Theorem 1 may not apply, data-driven methods we discuss in Section 3 can be employed to estimate the bias.

2.3 A Two Sample Test Using Mutual Information

We have already seen in Section 1 that the MI can be utilised to test whether two or more samples were obtained from the same or different distributions. We restrict our discussion in this work to

7 the two sample case, but a generalization can easily be achieved. Let us now describe the test procedure in more detail. As set in 1, we are going to discuss the method to construct a test statistic when the two samples are from continuous distributions, but a test when they are from discrete distributions can also be obtained; for motivation see Chatzikokolakis, Chothia and Guha (2010).

Suppose we have n independent observations Y01,Y02, ··· ,Y0n from a distribution F0 and fur- ther m independent observations Y11, Y12, ··· ,Y1m from a possibly different distribution F1. Let us write Y0 := {Y01,Y02, ··· ,Y0n} and Y1 := {Y11, Y12, ··· ,Y1m}. We want to test the null hypothesis H0: F0 = F1 against the alternative H1: F0 6= F1. We further combine Y = (Y0, Y1) and create a 0 − 1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n zeroes followed by m 1s. ˆ The test statistic to be used is IXY , as described in (7). Note that choice of the bandwidth is an issue. Optimum choice of bandwidth for the kernel density estimate is well studied, and a number of options are available to suit different situations, see Silverman (1986) and Sheather and Jones (1991). Following Silverman (1986), to expedite the computation process we choose a “rule-of-thumb” optimal bandwidth function which is best suited when the original distribution is Gaussian but is also known to work well for distributions which are not heavily skewed:

−1/5 hOPT = 1.06SD(Y )n , (11) where SD(Y ) is the standard deviation of Y . Although this choice works fairly well in the situations we encounter during simulations and data analysis, an optimal choice of bandwidth for mutual information estimates remains a subject of futue studies. ˆ The asymptotic normality of IXY under H0 : F0 = F1 and assumptions A1-A4 could be utilized to construct a critical region for the MI test. However, the normal approximation often does not work very well for the MI estimates except for large samples, a phenomenon also observed by Fernandes and Neri (2008). Moreover, the assumption of finite support of Y is restrictive, and requires the estimation of the length of support from the sample. A more robust procedure is to use bootstrap techniques to obtain the critical region. An advantage is that this technique also works when the samples are not necessarily from distributions with finite support, and hence is less restrictive.

8 A bootstrap-based critical region for the above mentioned test at level of significance α when comparing the two samples Y0 and Y1 of size n and m respectively can be obtained through the following steps:

1. Combine Y0 and Y1 in one single sample which we denote by Y.

2. Simulate a random sample of size N := m + n, say X1 from the Bernoulli distribution with

p1 = m/N and compute its mutual information with Y. Let us denote the value of the

estimated mutual information by I1. We repeat this step a large number of times, say K,

and obtain Ij; j = 1, 2 ··· ,K.

3. Use the 100(1 − α)th percentile of the sampling distribution of I1, ··· ,IK as the cut-off for the test with level α.

Alternatively, an estimated p-value of the test statistic can be reported when a test sample is available, which may be computed as the proportion of I1, ··· ,IK exceeding the observed MI for the test sample, say I.

3 Some Simulations

We now compare the power of the MI test to the power of some conventional tests, namely the KS test, the CVM test, the AD test and the BWS test at 5% level of significance.

To obtain the power of the MI test when m samples from the distribution F0 and n samples from distribution F1 are to be compared, we use an Epanechnikov kernel based estimate with the bandwidth chosen according to (11) to obtain the MI test statistics, and use the following algorithm to compute the power of the MI test:

1. Obtain samples Y0 of length m from F0 and Y1 of length n from F1. Combine them in one single vector which we denote by Y.

2. We simulate a random sample of size N := m + n, say X1 from the Bernoulli distribution with P (1) = m/N and compute its mutual information with Y. Let us denote the value

of the estimated mutual information by I1. We repeat this step 10,000 times, and obtain

Ij; j = 1, 2 ··· , 10, 000.

9 3. We use the 100(1−α)th percentile of the sampling distribution of I1, ··· ,I10,000 as the cut-off to be used for rejection for the test with level α.

4. We now again simulate samples Y0 of length m from F0 and Y1 of length n from F1, and again combine them in one single sample which we denote by Y.

5. Define a 0-1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the

jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n zeroes followed by m 1s. Compute the MI estimate between Y obtained in step 4 with the

obtained X. H0 is rejected if the obtained mutual information estimate is greater than the cut-off obtained in step 3.

6. Repeat steps 4 and 5 1000 times to estimate the power of the test, given by the proportion

of rejections of H0.

Following the methods described in Xiao et al. (2007) and Baumgartner, Weiß and Schindler (1998), the p-values and the cut-offs of the CVM test and the BWS tests for 5% level are estimated using the appropriate quantiles of the sampling distributions of the said test statistics for relevant values of m and n based on samples from the standard normal distribution of sizes m and n. The sampling distribution is computed based on 10000 samples. For the KS and AD tests we use the asymptotic properties of the test statistics following Conover (1971) and Scholz (1987) respectively. To estimate the power of the tests we compare the percentage of rejections of H0 by the tests based on 1000 random pairs of samples with m = n = 100 and 500. When the sample sizes grow larger, all the tests start performing well; we have also explored the power of the tests when the samples are of size 2500. As all the tests perform very well for that size, we omit the large size samples from our discussion for the sake of brevity. It may be noted here that the proposed MI test statistic is based on density estimates so for very small sample sizes it should not be used. We already know about the existence of a number of efficient tests of location. When comparing two normal distributions with equal , the UMPU test is the two sample t-test, Lehmann and Romano (2010). However, the main goal of our work, as will further be explained in the next section, is to obtain a test when two samples are matched for their location. Hence, during the simulation exercise we primarily concentrate on comparing the power of different situations

10 when the first moments of the distributions are matched. Towards that, we compare samples from standard normal distribution with samples from normal distribution with zero mean and variance increasing away from 1. The results are summarized in Figure 1. The MI test clearly appears to be the most powerful, as can be seen in Figure 1, followed closely by the AD and the BWS tests. As examples of distributions where the support is finite, so that Theorem 1 holds and

● ● ● 1.0 ● 1.0

0.8 ● 0.8

● 0.6 0.6 0.4 0.4 Rejections(%) Rejections(%) ●

● MI test ● MI test ● ●

0.2 KS test 0.2 KS test CVM test CVM test

● AD test AD test ● ● BWS test BWS test 0.0 0.0

1.0 1.2 1.4 1.6 1.0 1.1 1.2 1.3 1.4 1.5 σ σ (a) Sample size m=n=100 (b) Sample size m=n=500

Figure 1: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1) samples with N(0, σ2) samples.

● ● 0.6 ● MI test 1.0 ● MI test ● KS test KS test CVM test CVM test 0.5

AD test 0.8 AD test BWS test BWS test ● 0.4 0.6

0.3 ● 0.4 Rejections(%) Rejections(%) 0.2

● ● 0.2 0.1

● ● ● 0.0 0.0

5 10 15 20 5 10 15 20 df df (a) Sample size m=n=100 (b) Sample size m=n=500

Figure 2: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1) samples with samples from t distributions with various degrees of freedom. the MI based test statistic works well, we compare the uniform distribution on (−1, 1) with other symmetric uniform distributions on intervals (−a, a), and also the uniform distribution on the unit interval with the uniform distributions on intervals (0, a), for values of a gradually increasing from 1. The results are shown respectively in Figures 3 and 4. The MI test once again appears the most powerful.

11 ● ● ● ● 1.0 1.0 ●

● 0.8 0.8

● 0.6 0.6 Rejections(%) Rejections(%) 0.4 0.4

● ● MI test ● MI test

0.2 KS test 0.2 KS test CVM test ● CVM test AD test AD test ● BWS test ● BWS test 0.0 0.0

1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.1 1.2 1.3 1.4 1.5

a a (a) Sample size m=n=100 (b) Sample size m=n=500

Figure 3: Plot of the estimated power of various tests based on 1000 replications comparing U(−1, 1) samples with samples from U(−a, a) distributions for different values of a.

● ● ● ● ● ● 1.0 1.0

● 0.8 0.8 0.6 0.6

● Rejections(%) Rejections(%) 0.4 0.4

● ● MI test ● MI test

0.2 KS test 0.2 KS test CVM test CVM test AD test AD test ● BWS test ● BWS test 0.0 0.0

1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.1 1.2 1.3 1.4 1.5

a a (a) Sample size m=n=100 (b) Sample size m=n=500

Figure 4: Plot of the estimated power of various tests based on 1000 replications comparing U(0, 1) samples with samples from U(0, a) distributions for different values of a.

It is of interest in observing the discrimination power of the tests when a sequence of distribu- tions is compared with their eventual limit. Towards that, we compare samples from the standard normal distribution with t-distributions with gradually increasing degrees of freedom. The results are presented in Figure 2. The MI test exhibits superior discriminating power when comparing the standard normal distribution with t-distributions with different degrees of freedom. Whereas for the other tests power decreases to around 0.05 very qucikly, the power of the MI test decreases much more slowly; for example, when tested on the basis of 500 samples, the MI test is the only test with any discriminating power between the standard normal and the t distribution with 20 degrees of freedom. Finally, for further comparison of two distributions with different shapes but equal means and variances, we compare the N(0, 1/12) distribution with the uniform distribution on the interval

12 ● ● ● ● ● ● 1.0 0.8

● 0.6 0.4 Rejections(%)

● MI test

0.2 KS test CVM test AD test BWS test 0.0

100 200 300 400 500 600 700 sample size(m=n)

Figure 5: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1/12) samples with samples from U(−0.5, 0.5) distribution for different sample sizes.

(−0.5, 0.5) for various sample sizes. The results are presented in Figure 5. Similar to the previous cases, the MI test is again the most powerful followed by the BWS test and the AD test, whereas the KS and the CVM tests do not perform very well.

4 Analysis of Passport Information Leakage

4.1 e-Passports

In this section we use the MI to analyze the security of the radio frequency identification (RFID) chip embedded into e-passports. These chips broadcast the information printed on the passport and a JPEG copy of the picture with the aim of making immigration controls faster and more secure. These chips may also optionally include fingerprints, iris scans and additional personal information. e-Passports are specified by the International Civil Aviation Organisation (ICAO) and most countries implement their own version. Read access to the data on the RFID chip is protected by a cryptographic key based on the date of birth of the passport holder, the date of expiry of the passport and the passport number. The idea behind this is to allow read access to anyone who has physical access to the passport, but to stop covert “skimming” of the data without the owners knowledge. These passports also aim to be untraceable: if you do not know the passport’s cryptographic key, it should be impossible to distinguish one passport from any other, and in particular, it should be impossible to recognize a passport that you have seen before from the radio messages it transmits.

13 A reader that knows the date of birth and expiry, and passport number can use these to generate an encryption key and a message authentication code (MAC) key (which is used for error checking). Both these keys are unique to the individual passport. The reader and the passport then exchange a number of messages that let them prove to each other that they both know the cryptographic keys. The reader powers up the passport, and the passport then sends a random number back to the reader. The reader then generates its own random number and encrypts both the random numbers using the passport encryption key. The MAC key is used to generate a short error checking code, for the encrypted numbers, and then both the encrypted numbers and the error checking code are sent to the passport. The passport uses its unique MAC key to check that it has received the message containing the encrypted numbers correctly. If this check is passed, the passport then decrypts the message and checks that it contains the random number it sent to the reader. This proves to the passport that the reader really knows the passport’s unique cryptographic key and is not, for instance, replaying an old message. The passport then encrypts the reader’s random number and sends it back to the reader, with another MAC error checking code. Once the reader successfully receives its own number back it can be sure that it is communicating with the passport (or, at least, a device that knows the passport’s key). After this exchange of messages the passport will allow the reader access to the information stored on the chip.

4.1.1 Tracing e-Passports

To an outside observer, all of the messages exchanged by the passport and reader appear to be completely random. If an attacker tried to record a message and replay it to the passport during a later session the messages would be rejected, because the random numbers would not match. However, while investigating actual passports, we found that there was a way to identify a passport without knowing the passport’s cryptographic key. To be able to trace a passport attackers must first observe an exchange between the passport they wish to trace and a legitimate reader. While doing so, they must record the message from the reader that includes the passport’s random number and the error checking code, produced using its unique MAC key.

14 When the attacker comes across a passport later and wants to remotely check if it is the same passport as before, it starts a new run of the protocol. The passport sends the attacker a random number, and the attacker then replays the messages it previously recorded. There are now two possibilities, first it may be a different passport than before; in this case the check of the MAC fails, because each passport uses a different MAC key, and the passport sends an error message. Second, it might be the same passport again. In this case the check of the MAC succeeds, the message is then decrypted. However, the random number in this message would be from the old session, so it would not match the random number the passport expects, and the passport would stop the exchange and send an error message. In both cases the replayed message is rejected and the attacker is denied access to the data on the chip. However, when we experimented with actual passports we observed that it took longer for a passport to reject the replayed message in the second case, i.e., when it was the passport we were trying to trace. This is because the message uses the passport’s unique MAC key, so the MAC check succeeds and the passport has to go on to do the decryption. On the other hand, if it was a different passport the MAC check would have failed and the message would have been rejected sooner. This difference in response times can be used to detect particular passports. While the range of the RFID chips is limited it would certainly be possible to, for instance, build a device that would sit next to a door way and remotely detect when certain particular people entered or left a building. When the existence of this attack was announced in at a computer science conference (Chothia and Smirnov 2010), this gained some media attention, see Goodin (2010). In this paper, we present a full analysis of the timing information and consider ways to fix this information leak.

4.2 Analysing e-Passports

We now apply the MI to analyze the extent to which passports from different countries can be traced and to assess the effectiveness of some possible fixes. In this setting, we replay a message to a passport and look for any relationship between the time it takes a passport to respond and whether or not the message came from that particular passport. In terms of our set up in Section 1, X is 1 if the passport we replay the message to is the same one used in the session where the message was recorded, and X is 0 if the message did not come from this particular passport. The

15 continuous variable Y in this example is the time it takes to reject the message. The passport is considered to be secure if, and only if, no evidence of dependence between X and Y is present.

4.2.1 Data Collection

Each country implements its own version of the e-passport, and the time taken for passports to communicate with a reader always have the same distribution when they have the same nationality. We therefore tested one passport each from four different countries: Germany, Greece, Ireland and the UK. For each of these we first calculated the passport’s cryptographic key from the date of birth, date of expiry and passport number. Then, using a basic RFID reader, we ran the access protocol and recorded the message we needed to replay. For the German, Greek and Irish passports we replayed the message to the passport 500 times, and then sent a random message 500 times (simulating a message from a different passport). For the British passport we replayed the message to the passport 1000 times, and then sent a random message 1000 times. We added a clock to our computer program to exactly measure the time between when the replayed message was sent and when the passport’s error message was received by the reader.

4.2.2 Analyzing the Times

The response times are shown in Figure 6. Here the solid lines show the response time when the passport is the same, and the dotted lines show the times when the passport is different. In each of these cases the time difference is clear, however there is some overlap between the times. The first two columns of Table 1 presents the values of the MI test statistics computed following the methods as described in Section 3 and the corresponding p-value estimates based on 10000 bootstrapped samples. It is observed that the MI estimates are very near to 1 for all four passports considered and hence it is obvious that the passports can be traced.

4.2.3 Testing a Fix for the Passport

To fix the leak in the passports, it is intuitive that if the passport goes into decryption after the MAC check regardless the result of the check, then the information leak discussed in Section 4.2.1 may be blocked. However, by looking at the plots in Figure 6, it apparently may seem that by somehow applying an artificial “time-padding” to the response time, (i.e. applying a shift of

16 Density Comparison for British Data Density Comparison for Greek Data

Original Passport 7000 Original Passport

4000 Different Passport Different Passport 6000 3000 5000 4000 2000 3000 Estimated Density Estimated Density 2000 1000 1000 0 0

0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.670 0.044 0.046 0.048 0.050 0.052

(a) UK passport on reader (b) Greek passport on reader

Density Comparison for Irish Data Density Comparison for German Passport Data

Original Passport Original Passport 15000 Different Passport Different Passport 5000 4000 10000 3000 Estimated Density Estimated Density 2000 5000 1000 0 0

0.044 0.046 0.048 0.050 0.052 0.130 0.132 0.134 0.136 0.138 0.140

(c) Irish passport on reader (d) German passport on reader

Figure 6: Response times for replaying a message to passports location,) when the passport does not go into the decryption stage, a quick fix may be achieved. To examine this idea, we experimented with “time padding” by various constants, including the difference of means and difference of medians of the response times for the same and the different passports. Adding the difference of medians seemed to work the best in terms of reduction of the MI estimates. The corresponding MI test statistics are presented in the third column of Table 1. All the MI values show significant reduction, and hence the fix may seem to be working. However, if we look at the estimated p-values presented in the last column of Table 1, we can see a different story. Only for the Greek passport the p-value increases from 0; hence the problem is not solved for any of the other 3 passports. For the Greek passport though at a 5% level of significance it may be concluded that it has been “fixed”. Table 2 compares the p-values of the other nonparametric tests discussed in this paper for the fix applied to various passports. They agree to the conclusion of the MI test for the British, Irish and Greek passports. However, every other test fail to detect a leak in the fixed German

17 Nationality MI before padding p-value MI after padding p-value British 0.9542736 0 0.09446402 0 Irish 0.9999755 0 0.04872853 0 Greek 0.9795026 0 0.01775579 0.075 German 0.983794 0 0.03101871 0

Table 1: A comparison of the mutual information estimates obtained for different passports before and after applying the time padding based on difference of medians.

Nationality MI test KS test CVM test AD test BWS test British 0 3×10−12 0 0 0 Irish 0 8×10−10 0 0 0 Greek 0.075 0.718 0.544 0.3671 0.408 German 0 0.2574316 0.7425 0.3017 0.2705

Table 2: A comparison of the p-values of different test statistics obtained for different passports after applying the time padding based on difference of medians. passport, which clearly demonstrates the superior sensitivity of the MI test, and justifies its use in this situation. The case of the Greek passport also needs further discussion. It seems that all the tests agree that the “fix” works for the Greek passport, at least at a 5% level of significance. However, if we look at the p-value of the MI test, it is only 0.079, which indicates that there might be some difference. Hence, it seems that overall this fix may not be very efficient, although it does seem at least to work partially, and at least reduce the chance of the detection of a leak significantly. However, to device a completely leak-proof passport, a better fix is obviously required.

5 Discussion

In this work we proposed a MI based two-sample test for samples from continuous distributions. We have discussed a kernel density estimate based estimate of the test statistic and provided an asymptotic distribution of the test statistic when the two samples are from the same distribution under some restrictive conditions. It was established through simulations that the test works well even when some of the conditions of the asymptotic result, most notably the condition that

18 Density Comparison for German Passport Data after Median Correction Density Comparison for Greek Data of the Response to Incorrect Passport

7000 Original Passport Original Passport Different Passport Different Passport 5000 6000 4000 5000 4000 3000 3000 Estimated Density Estimated Density 2000 2000 1000 1000 0 0

0.048 0.049 0.050 0.051 0.052 0.053 0.135 0.136 0.137 0.138 0.139 0.140

(a) Greek passport on reader (b) German passport on reader

Figure 7: Comparing the response times after padding the reply times for the Greek and the German passports by the difference of the median response times. the support of the distributions be finite, are violated. We presented some simulation-based comparison with other tests in various situations, and the MI test appeared superior in terms of power when comparing samples varying in scale; and also for samples from different distributions with identical location and scale parameters. Finally, we justified its use in the present analysis by demonstrating an example where it found a leak in a “fixed” e-passport where other tests failed. In the examples we mention above, the cases of the German and the Greek passports for which all other tests failed to show a leak but a visual inspection made us suspicious otherwise, see Figure 7, were the main motivation to think of a more sensitive test. This led to our idea of the MI based test which supported our suspicion that there still was a substantial leak. Whereas the use of MI is quite popular in various areas of applied sciences, as far as we know our application of this statistic for a two sample test is new. A point to be noted here that we do not claim that our test is the best in all situations. For example if the object of interest is the test of location, one might as well use the t-test which is UMPU for normal distributions and is quite robust in other situations, and is by far the simplest to understand and apply. Our interest is mostly in detecting differences beyond difference in location, and as we have established here through simulations and examples, the MI test has superior power in a variety of situations. Similarly, dedicated tests of scale, eg. the test proposed by Levene (1960), may work better than general tests in specific situations, but may not be of any use in some other situations, for example in the

19 problem of improving a passport that we discuss in Section 4. That is why all our studies have been dedicated to comparing tests of more general nature. It should be noted that although Theorem 1 only applies to distributions with bounded sup- port, the simulations in Section 3 establish that the MI test works fairly well in situations with unbounded supports as well. We hope to explore these more general situations in a future work. Perhaps the best feature of the MI test is that we can extend this procedure to compare k(≥ 2) samples quite naturally to check whether all of them are from the same distribution by testing whether the MI between a combined sample and an index vector of all samples with different indicators for each samples is 0 or not. The expression for the MI will be obtained by a simple extension of (3) to the case where X can take values 0, 1, ··· , k − 1. We have taken this up in a parallel ongoing work. It should be noted here that the other two-sample tests discussed here also have extensions to multi-sample cases, see Zhang and Wu (2007) for a detailed discussion. However, the extensions depend on choice of weight functions, and the choice of optimum combination of weights are not obvious. Such issues do not arise with MI based tests. A notable difference of the MI test from the well-known non-parametric two-sample tests is that it is not a rank-based test. Whereas the rank-based tests have many advantages, their problems with ties are well-documented, which often require special treatments. The problems with ties are usually more severe for discrete data, although they may also arise with larger samples from continuous distributions when the data values are rounded off, as often is the case in real life. Some versions of the rank-based tests do exist for discrete data in special cases, for example see Scholz (1987) for the AD test. However, their applications are not easy and often suffer from loss of power. The MI has a more obvious extension to the discrete situation as it can be estimated based on the frequencies of the different values of the discrete variables, and hence ties would not impact the performance of the test statistic computed based on MI. See Chatzikokolakis, Chothia and Guha (2010) and Biswas and Guha (2009) for some examples of usage of the MI in the discrete situation. Finally, setting h = N −(1/4+δ) for a small δ, 0 < δ < 1/4, a rate of convergence of N −(3/4−δ) for the MI estimate (to 0) can be achieved, which is superior to the best rate for the class of estimates discussed by Stone (1980). Hence, it may be concluded that despite being biased, the MI estimate is an efficient one due to its superior rate of convergence, and hence the performance of the MI

20 test statistic may also be expected to improve quickly with larger sample sizes compared to other non-parametric tests.

References

Alvim, M., Andr´es,Mi. & Palamidessi, C. (2010), “Entropy and Attack Models in Information Flow”,Theoretical Computer Science, IFIP Advances in Information and Communication Technology, 323, 53–54.

Anderson, T. W. (1962), “On the Distribution of the Two-Sample Cram´er-von Mises Criterion”, Annals of Mathematical Statistics 33, 1148-1159.

Antos, A. & Kontoyiannis, Y. (2001), “Convergence properties of functional estimates for discrete distributions”, Random Structures & Algorithms 19, 163–193.

Biswas, A. & Guha, A. (2009), “Time series analysis of categorical data on infant sleep status using auto-mutual information”, Journal of Statistical Planning and Inference 139, 3076–3087.

Biswas, A. & Guha, A. (2010), “Time series analysis of hybrid neurophysiological data and appli- cation of mutual information”, Journal of Computational Neuroscience 29, 35–47.

Brillinger, D. R., (2004), “Some data analysis using mutual information”, Brazilian Journal of Probability and Statistics 18,163–183.

Brillinger, D. R. & Guha, A. (2007), “Mutual information in the frequency domain”, Journal of Statistical Planning and Inference 137, 1074–1086.

Baumgartner, W., Weiß, P. & Schindler, H. (1998), “A Nonparametric Test for the General Two- Sample Problem”, Biometrics 54, 1129–1135.

Bosq, D. (1996), Nonparametric Statistics for Stochastic Processes, Springer-Verlag, New York.

Chatzikokolakis, K., Palamidessi, C. & Panangaden, P. (2008), “Anonymity protocols as noisy channels ”, Information and Computation 206, 378–401

21 Chatzikokolakis, K., Chothia, T. & Guha, A. (2010), “Statistical Measurement of Information Leakage”. Proceedings of TACAS 2010, 390–404.

Chothia, T. & Smirnov V. (2010), “A traceability attack against e-passports”, Proceedings of the 14th International Conference on Financial Cryptography and Data Security, Springer, LNCS 6052.

Conover, W. J. (1971), Practical Nonparametric Statistics, John Wiley & Sons, New York.

Cover, T. M. & Thomas, J. A., (1991), Elements of Information Theory, Wiley, New York.

Fernandes, M. & Neri, B. (2008), “Nonparametric entropy-based tests of independence between stochastic processes”, Econometric Reviews, , 276–306.

Gibbons, J. D. & Chakraborti, S. (2010), Nonparametric Statistical Inference, 5th edition, Chap- man and Hall, London.

Goodin, D. (2010), “Defects in e-passports allow real-time tracking”, The Register, www. theregister.co.uk/2010/01/26/epassport_rfid_weakness/

Guha, A. (2005), Analysis of Dependence Structures of Hybrid Stochastic Processes Using Mutual Information, Ph.D. Thesis, University of California, Berkeley.

Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application, Academic Press, San Fransisco.

Jeske, D. R., Lockhart, R. A., Stephens, M. A. and Zhang, Q. (2008), “Cram´er-von Mises tests for the compatibility of two software operating environments”,Technometrics, 50, 53–63.

Lehmann, E. L. & Romano, J. P. (2010), Testing Statistical Hypotheses, 3rd edition, Springer, New York.

Levene, H. (1960), “Robust tests for equality of variances.” In: Contributions to Probability and Statistics. Stanford University Press, Stanford. 278-292.

Moddemeijer, R. (1989), “On estimation of entropy and mutual information of continuous distri- butions”, Signal Processing. 16, 233–248.

22 Neuh¨auser,M. (2005), “Exact tests based on the Baumgartner-Weiss-Schindler statistic - a sur- vey”, Statistical Papers, 46, 1–29.

Paninski, L. (2003), “Estimation of entropy and mutual information”, Neural Computation 15, 1191–1253.

Pettitt, A. N. (1976), “A two-sample Anderson-Darling rank statistic”, Biometrika 63, 161–168.

Scholz, F. W. & Stephens, M. A. (1987), “K-sample Anderson-Darling Tests”, Journal of the American Statistical Association 82, 918-924.

Shannon, C. E. (1948), “A mathematical theory of communication”, Bell System Technical Journal 27, 379–423 & 623–656.

Sheather, S. J. & Jones M. C. (1991), “A reliable data-based bandwidth selection method for kernel density estimation”, Journal of the Royal Statistical Society B, 53, 683–690.

Silverman, B. W. (1986), Density estimation for statistics and data analysis, Chapman and Hall, London.

Stone, C. J. (1980), “Optimal rates of convergence for nonparametric estimators”, Annals of Statistics 8, 1348–1360.

Wald, A. & J. Wolfowitz, J. (1940), “On a test whether two samples are from the same population”, Annals of Mathematical Statistics 11, 147–162.

Wylupek, G. (2010), “Data-driven k-sample tests”. Technometrics 52, 107–123.

Xiao, Y., Gordon A. & Yakovlev, A. (2007), “A C++ program for the Cram´er-von Mises two sample test”, Journal of Statistical Software 17.

Zhang, J. (2006), “Powerful two sample tests based on the likelihood ratio”, Technometrics 48, 95– 103.

Zhang, J. and Wu, Y. (2007), “k-sample tests based on the likelihood ratio”, Technometrics 51, 4682–4691.

23 A Proof of Theorem 1

We now provide a brief idea of the proof of Theorem 1. The details are similar to Guha (2005) and are provided as supplimentary materials. Firstly, note that Z   Z   ˆ pˆ(0, y) pˆ(1, y) IXY /log (e) = ln pˆ(0, y) dy + ln pˆ(1, y) dy, y:ˆp(0,y)>0 pˆ0pˆ(y) y:ˆp(1,y)>0 pˆ1pˆ(y)

ˆ where IXY is as in (6), and ln(x) = loge(x). For simplification of notation, denote Y − y  K (y) = K i ; χ = χ ; χ0 = χ − p ; j = 0, 1. (12) hi h ij Xi=j ij ij j

Now using arguments similar to Fernandes and Neri (2008), it can be shown that when assumption

A2 holds and H0 is true, then 1 Z f 2(0, y) f 2(1, y) f 2(y)  1  Iˆ = + − dy + o , (13) XY 2 p(0, y) p(1, y) p(y) p Nh1/2 where

f(y) =p ˆ(y) − p(y); f(j, y) =p ˆ(j, y) − p(j, y); fj =p ˆj − pj forj = 0, 1. (14)

When H0 is true, the first expression in the right hand side of (13) can be broken into a sum of two parts as 1 2 (T1 + T2) (15) 2(Nh) p1p0 where N Z 2 N Z X 2 K (y) X X Khi(y)Khj(y) T = χ0 hi dy; T = 2 χ0 χ0 dy. (16) 1 i1 p(y) 2 i1 j1 p(y) i=1 i=1 j

H0 it can be shown that

N  Z 2  Z Z X 2 K (y) E(T ) = E χ0 hi dy = Nhp p I dy K2(u)du + o(Nh), 1 i1 p(y) 1 0 p(y)>0 i=1  N Z 2  X 2 K (y) V ar(T ) = V ar χ0 hi dy = O(Nh); (17) 1 i1 p(y) i=1 Z Z 2 2 3 2 2 3 E(T2) = 0; V ar(T2) = 2N h K(w)K(u + w) du dw (p1p0) + o(N h ).

24 From the above it follows that Z 1 P 0 2 ! Z 1 P 0 2 ! 1 Nh i Khi(y)χi1 C1 1 Nh i Khi(y)χi1 C2 E dy ≈ ; Var dy ≈ 2 . 2p1p0 p(y) Nh 2p1p0 p(y) N h

Let us next define Z 1 κ = K (y)K (y) dy. (18) h;ij p(y) hi hj

PN P 0 0 so that T2 = 2 i=1 j

To prove Theorem 1, it is now enough to show that N X 2 2 Zi ⇒ N(0; C2p0p1) (20) i=1 if assumptions A1-A4 are true. This result follows by an application of Theorem 3.2 from Hall and Heyde (1980).

25