Improvement of Nist Statistical Tests

IMPROVEMENT OF NIST STATISTICAL TESTS

EMIL SIMION1

In cadrul acestei lucrări prezentăm o serie de îmbunătăţiri ale procesului de decizie asupra aleatorismului propus de Institutul Naţional de Standardizare din SUA în cadrul ghidului SP 800-22, prin calculul probabilităţii de acceptare a unei ipoteze false.

In this paper we propose an improvement of the decision regarding the randomness, proposed by National Institute of Standards and Technologies (NIST) in the guideline Statistical Test Suite (STS) Special Publication (SP) 800-22, on computing the second order error (the probability of acceptance a false hypothesis).

Key words: cryptographic statistical testing, NIST SP 800-22.

1. Introduction

We need to develop statistical tools for testing the degree of randomness of binary sequences used for cryptographic applications. Such tools exist in the field of public cryptography area: i) Donald Knuth’s book [1], The Art of Computer Programming, Seminumerical Algorithms, describes several empirical tests which include the: frequency, serial, gap, poker, coupon collector’s, permutation, run, maximum-of-t, collision, birthday spacing’s, and serial correlation; ii) The Crypt-XS suite of statistical tests was developed by researchers at the Information Security Research Centre at Queensland University of Technology in Australia. Crypt-XS tests include the frequency, binary derivative, change point, runs, sequence complexity and linear complexity; iii) The DIEHARD suite of statistical tests developed by George Marsaglia consists of fifteen tests, namely the: birthday spacings, overlapping permutations, ranks of 31 x 31 and 32 x 32 matrices, ranks of 6x8 matrices, monkey tests on 20-bit Words, monkey tests OPSO (Overlapping-Pairs-Sparse-Occupancy), OQSO (Overlapping- Quadruples-Sparse- Occupancy), DNA, count the 1’s in a stream of

1 Associate professor, University Politehnica of Bucharest, Faculty of Applied Sciences, Romania, e-mail: [email protected] Emil Simion

bytes, count the 1’s in specific bytes, parking lot, minimum distance, random spheres, squeeze, overlapping sums, runs, and craps. Additional information may be found in G. Marsaglia [2]; iv) NIST 800-22 [3] is a publication of sixteen statistical tests, which can be founded at the Internet page of Computer Security Research Center [4] among with an implementation of this tool. We must remark the fact that NIST 800-22 was one of the cryptographic tools, which was involved in evaluation of the candidates for Advanced Encryption Standard (FIPS PUB 197). In this paper we propose the extension of National Institute of Standards and Technologies (NIST) Statistical Test Suite (STS) Special Publication (SP) 800-22 on computing the second order error (the probability of acceptance a false hypothesis). The paper is organized as follows. In section 2 we present the concept of statistical testing and the rules for decision regarding the randomness of a binary sequence. Section 3 is focused on NIST decision procedures about randomness and some extensions of test interpretation. These extensions are about the computing of the probability of a false hypothesis which is exemplified in section 4 in conjunction with the presentation, in section 5, of a comparative graphical interpretation of the probability of rejection of a true hypothesis. Finally, in section 6 we conclude.

2. Statistical testing

A statistical test provides a mechanism for making decisions, using data, about a binary sequence x{0,1}, which usually represents the output of a source. The aim is to decide whether there is enough evidence to "reject" a conjecture or hypothesis about the sequence. The hypothesis to be tested represents an assumption that may or may not be true and it is called a statistical hypothesis. A statistical test requires a pair of hypothesis regarding the sequence to be tested: i) The null hypothesis H0 - the sequence x is produced by a binary memory less source: Pr(X=1) = p0 and thus Pr(X=0) = 1-p0 (in this case we say that the sequence does not present any predictable component); ii) The alternative hypothesis H1 - the sequence x is produced by a binary memory less source: Pr(X=1) = p1 and Pr(X=0) = 1-p1 with p0  p1, (in this case we say that the sequence present a predictable component regarding the probability p). Two types of errors can result from a statistical test: the first order error - the risk of rejecting the null hypothesis when it is in fact true, also called the level of significance: = Pr (reject H0 | H0 is true) = 1 - Pr (accept H0 | H0 is true) Improvement of NIST statistical tests

and the second order error - the risk of failing to reject (accepting) the null hypothesis when it is in fact false, = Pr (accept H0 | H0 is false) = 1 – Pr (reject H0 | H0 is false). The values 1- and 1- are called the specificity2 respectively the power (or sensitivity3) of the test. These two errors  and  can't be minimized simultaneous since the risk  increases as the risk  decreases and vice versa (Neymann-Pearson tests minimize the value of  for a given ). The following table gives the relationship between the truth of the null hypothesis and outcomes of the test.

Real situation

Conclusion H0 is true H0 is false Reject H0  (false positive result) 1- (true positive result) Accept H0 1- (true negative result)  (false negative result)

There is other error that the evaluator can make during the testing process: type III error4 when is to ask the wrong question and use the wrong null hypothesis. The analysis plan of the statistical test includes decision rules for rejecting the null hypothesis. These rules are described in two ways: i) Decision based on confidence intervals: an example of decision rule is the following: for a fixed value of  we find the confidence region for the test statistic and check if the test statistic value is in the confidence region. The confidence region is computed using the quantiles of order /2 and 1 - /2 (for example the quantile u of order  is defined by Pr(X < u) =  ii) Decision based on P-value: Let us denote ftest the value of test function.

Another equivalent method is to compare the P - value = Pr(X < ftest) with  and decide the randomness if P-value  .

3. State of the art of NIST testing decision functions

We need to develop dynamically statistical tools for testing the degree of randomness of binary sequences. Such tools exist in the field of public cryptography area. NIST 800-22 [3] is a publication of sixteen statistical tests, which can be founded at the Internet page of Computer Security Research Center [4] among with an implementation of this tool. We must remark the fact that NIST 800-22 was one of the cryptographic tools, which was involved in evaluation of the candidates for Advanced Encryption Standard (FIPS PUB 197). 2 Specificity measures the proportion of negatives which are correctly identified. 3 Sensitivity measures the proportion of actual positives which are correctly identified. 4 These error was defined for the first time in system theory. Emil Simion

NIST SP 800-22 is a tutorial including 16 statistical randomness tests, used for evaluating pseudorandom generators as well as random number generators. Tests stipulated in NIST 800-22 are made on 1st degree risk (1% probability to reject a true hypothesis). These tests are not independent, making difficult to calculate a global rejection rate. Each of these statistical tests highlights a certain fault type proper to regularity deviation. Besides, none of these tests calculates 2nd degree risk (the probability to accept a false hypothesis), therefore we considered useful to modify them in order to calculate this probability. NIST 800-22 provides two methods for integrating the 16 tests results, namely: i) percentage of passed tests; ii) uniformity of P values. The experiments revealed these decision rules were insufficient and therefore we considered their improvement would be useful. Therefore, in Oprina [5], were introduced new integration methods for these tests: i) majority decision; ii) maximum value decision, based on the max value of test statistics. In this case we compute the maximum value of the results, the distribution of the new statistics being normal; iii) sum of square decision, based on the sum of squares of the tests results, the distributions in this case is 2, the freedom degrees given by the number of partial results which are being integrated. It is important to remark that the level of significance of the test suite is hard to estimate because the tests are not independent. Thus, NIST tests can be generalized in the following way: i) generalize NIST STS 800-22 for an arbitrary  ii) compute the 2nd degree risk for each statistical test proposed in NIST STS 800-22 iii) derive and compute the minimum sample size to achieve the desired probability errors iv) introducing and implementing new decision rules.

4. Proof of concept: computing the 2nd risk in statistical frequency test

Let us show an example of a well-known simple statistical test, called the frequency test. It is used to test the randomness of a sequence of zeroes and ones. In fact, it tests the closeness of the proportion of ones to 0.5. Input: binary sequence s=s1,…,sn. Denote probability p0 of occurrences of symbol 1, and q0 = 1-p0 the occurrences of symbol 0. Improvement of NIST statistical tests

Output: The decision of acceptation or rejection of randomness, that the sequence N s is the output of a symmetric binary source with Pr(S = 1) = p0, the alternative hypothesis being Pr(S = 1) = p1  p0. STEP 0. Read the sequence s and the rejection rate . STEP 1. Compute the test function5: 1  n  f   si  np0 . np0 q0  i1 

STEP 2. If f  [u/2, u1-/2] we accept the hypothesis of randomness, else we reject. STEP 3. Compute the second error probability:  p q  n( p p )   p q  n( p p )   0 0  1  0   0 0  1  0     u     u   p q  1 np q   p q  np q   1 1  2 0 0   1 1  2 0 0 

5. Graphical interpretation

In figure 2 we show the graphical interpretation of a statistical test, in case of testing the null hypothesis H0: m = m0 = 0 against the alternative H1: m  m0. The reference distribution in this case is the normal one given by:

Fig. 1. Critical region of a statistical test at 0.01 level of significance. In real situation the value of the variance  is unknown and need to be

5 It can be easy proven that the random variable f is normal N(0;1). Emil Simion

s estimated. An unbiased estimation6 of the variance  is , where n n _ 1 2 s  (xi  x) , and n is the size of the sample. n 1 i1 _ n(x m ) Test outcome will be f  0 , the distribution of f is Student s distribution7 with n-1 degrees of freedom. The level of significance of the test was setup at = 0.01. For a sample of size n=1000 the critical region is (-;-2.33)  (2.33, ). Thus the probability to reject the null hypothesis in the situation that is true is Pr (f(-;-2.33)  (2.33, )|m=0)= 0.01, and the probability to accept the null hypothesis, in the situation that is true is 0.99. Direct computations lead to the probability of acceptance:

 n(m  m )   n(m  m )   0   0   (m;n)  Gn1 t    Gn1 t   ,  n1;1 s   n1; s   2   2  where Gn-1(x) is the repartition function of t(n-1). The value of second error risk depends on the alternative: simulating a N(0,1) we obtain a sample of n=1000 values for which the computed value of  is equal to 0,0565. Closer values for the alternatives m to m0, that is | m -m0 | , implies, for the same type of test and the same size of the sample, bigger variations for . _ Figure 2 reflects the above discussion, for the values of  = 1, m0 = 0 and x  1 (obtained from the samples values).

6. Conclusion  s  6 That is E   .  n  7 Student distribution or t-distribution has the following repartition probability density function:

 n 1 n1     2   t 2  2 f (t)  1  , where n is the number of degrees of freedom and  is the  n   n  n      2  Gamma function. Improvement of NIST statistical tests

We have presented statistical techniques principles used in cryptographic evaluation. These statistical principles are NIST SP 800-22 based and are improved by computing the second order error. In fact this error is the probability of acceptance a false hypothesis that is to accept as random a sequence, which in fact is not produced by a random source. Also there are discussed some improvements of integration the result of statistical test applied on different samples. One open problem is to derive formulas for the minimum sample size to guarantee a good estimation of the confidence interval for each statistics. Another question is to derive estimates on the power of a NIST SP 800 22 test suite, which is difficult to compute because the statistical tests are not quite independent.

B I B L I O G R A P H Y

[1] Donald Knuth, The Art of Computer Programming, Seminumerical Algorithms, Volume 2, 3rd edition, Addison Wesley, Reading, Massachusetts, 1998. [2] George Marsaglia, DIEHARD Statistical Tests: http://stat.fsu.edu/ geo/diehard.html. [3] *** NIST Special Publication 800-22, A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, 2001. [4] *** NIST standards: http://www.nist.gov/, http://www.csrc.nist.gov/. [5] A. Oprina, A. Popescu, E. Simion and Gh. Simion, Walsh-Hadamard Randomness Test and New Methods of Test Results Integration, Bulletin of Transilvania University of Braşov, vol. 2(51) Series III-2009, pg. 93-106.