<<

Statistical Testing of RNGs

For a sequence of numbers to be considered a sequence of randomly acquired numbers, it must have two basic statistical properties:

• Uniformly distributed values • Independence from previous values drawn

A sequence fails because patterning is present in the sequence with some degree of statistical certainty. This may manifest itself by:

• Clustering or bias occurs at some point (ie., statistically, the sequence is not uniformly distributed) − This includes sequences of tuples drawn from the sequence (if truly random, there should be n-space uniformity) • Standard statistical measures are too high or too low − value − Standard • Statistically significant patterns occur (indicating dependencies) − Alternating patterns − Distribution patterns (correlation) − Other patterning

It is always possible that some underlying patterns in a sequence will go undetected.

NIST and RNGs

In recognition of the importance of random number generators, particularly for areas such as cryptography, the NIST (http://csrc.nist.gov/rng/) has established guidelines for random number generation and testing with three primary goals:

1. to develop a battery of statistical tests to detect non- in binary sequences constructed using random number generators and pseudo-random number generators utilized in cryptographic applications

2. to produce documentation and software implementation of these tests

3. to provide guidance in the use and application of these tests

Typical Tests for Randomness as Reported in 1979 (Bright, H. and R. Enison. Computing Surveys 11(4). 1979. Quasi-Random Number Sequences from a Long-Period TLP Generator with Remarks on Application to Cryptography)

1. Equidistribution or Test. Count the number of times a member of the sequence falls into a given interval. The number should be approximately the same for intervals of the same length if the sequence is uniform.

2. Serial Test. This is a higher dimensional version of the equidistribution test. Count the number of times a k-tuple of members of the sequence falls into a given k-dimensional cell. If this test is passed, the sequence is said to be k-distributed. Other tests of k-distributivity are:

1. Poker test: Consider groups of k members of the sequence and count the number of distinct values represented in each group (e.g., a hand of 5 cards falls into one of a number of groups, such as one pair, two pairs, three of a kind, etc).

2. Maximum (minimum) of k: Plot the distribution of the function max (min) [of a k-tuple].

3. Sum of k

3. Gap Test. Plot the distribution of gaps in the sequence of various lengths (typically, a gap is the distance between an item and its next recurrence)

4. Runs Test. Plot the distribution of the runs up (monotonic increasing) and runs down (monotonic decreasing), or the distribution of the runs above the mean and those below the mean, or the distribution of the run lengths.

5. Coupon Collector's Test. Choose a suitably small integer d and divide the universe into d intervals. Then each member of the sequence falls into one such interval. Plot the distribution of runs of various lengths required to have all d intervals represented. The idea is that you are trying to collect all of the “coupons”.

6. Permutation Test. Study the order relations between the members of the sequence in groups of k. Each of the k! possible orders should occur about equally often. If the universe is large, the of equality is small; otherwise, equal members may be disregarded.

7. Serial Correlation Test. Compute the between consecutive members of the sequence. This gives the serial correlation for lag 1. Similarly, one may get the serial correlation for lag k by computing the correlation coefficient between xi and xi+k. This is to show that the members of the sequence are independent. Recall: the correlation coefficient references the method of for fitting a line to a set of points. The closer the correlation coefficient is to 1, the more confidence in using the regression line as a predictor. Hence, the an RNG should exhibit small serial correlation values.

Generally, the RNG is assumed to be producing ; ie., numbers in the interval [0,1].

Knuth, Seminumerical Algorithms, (Addison-Wesley, 1997 – 3rd edition) is an excellent source for a discussion on testing random number generators.

An often cited battery of tests for random number generators is Marsaglia’s 1995 Diehard collection (http://stat.fsu.edu/pub/diehard/) Both source and object files are provided, but at least some of the source is “an f2c conversion of patched and jumbled Fortran code,” which may limit its usefulness. Background for Hypothesis Testing

Given n observations, X1, X2, … , Xn (assumed to be from independent, identically distributed random variables), with (unknown) population mean µ and σ2. The sample mean

n X ∑ i X()n = i=1 n provides an (unbiased) for µ (meaning that if we perform a large number of independent trials, each producing a value for X(n) , the of the X(n) ’s will be a close approximation to µ).

Likewise the sample variance: n 2 ∑ [Xi − X()n ] S2 ()n = i=1 n −1

is an estimator for σ2 (the population variance is calculated by dividing by n; division by n-1 is used in this case because it can be proved that the sample variance is an unbiased estimator for σ2 when dealing with independent, identically distributed random variables)

Basically, two random variables are independent if there is no relationship between them; ie., the values assumed by one random tell us nothing about how the values of the other one distribute and vice-versa. Of course, using X(n) to estimate µ has little validity without knowing more information to provide an assessment of how closely X(n) approximates µ.

For the mean, it is easy to show that • E(cX) = cE(X) • E(X1 + X2) = E(X1) + E(X2)

For the variance, • Var(cX) = c2Var(X), but • Var(X1 + X2) = Var(X1) + Var(X2) only if the Xi’s are independent

Evidently, we must examine Var( X(n) ) to see how good an approximation to µ we can expect. Consider that ⎛ 1 n ⎞ 1 ⎛ n ⎞ Var X = Var X Var( X(n) ) = ⎜ ∑ i ⎟ 2 ⎜∑ i ⎟ ⎝ n i=1 ⎠ n ⎝ i=1 ⎠ 1 n = Var(X ) 2 ∑ i if the Xi’s are independent. n i=1

If so, then Var( X(n) ) = σ2/n

In other words, when the Xi’s are independent, it is quite clear that with larger values of n, the variance shrinks and averaging the sample provides a good estimate for µ.

Does this matter? Law and Kelton report that simulation output are almost always correlated; ie., the set of means acquired from multiple simulation runs lack independence. In this case, using the sample variance as an estimator may significantly underestimate the population variance, in turn leading to erroneous estimates of µ. and Correlation as Measures of Independence Between Two Random Variables

For two random variables X1 and X2, the covariance is given by Cov(X1, X2) = E[(X1 - µ1)( X2 - µ2)] = E(X1X2) - µ1µ2

E[(X1 - µ1)( X2 - µ2)] = E(X1X2) - µ1E(X2) - µ2E(X1) + µ1µ2 = E(X1X2) - µ1µ2 µ2 µ1 If X1 and X2 are independent, then E(X1X2) = E(X1)E(X2) (this can be determined from the fact that for independent events x and y, P(xy) = P(x)P(y)). Hence, when X1 and X2 are independent, Cov(X1, X2) = 0.

2 2 2 2 The relationship σ = E[(X-µ) ] = E(X ) - µ is not hard to show (we did 2 2 2 it earlier), so in particular, Cov(X1, X1) = E(X1 ) - µ1 = σ1 (a boundary case, since nothing could be less independent than a compared with itself).

When Cov(X1, X2) = 0, the two random variables are said to be uncorrelated. If X1 and X2 are independent, we noted Cov(X1, X2) = 0, which indicates there is some relationship between covariance and independence, but it turns out that this is not a necessary and sufficient condition (so, for those interested in , criteria under which this is a necessary and sufficient condition for independence provides a natural object of study).

When Cov(X1, X2) > 0, the two random variables are said to be positively correlated (the counterpart being negatively correlated). Intuitively,

positive correlation typically occurs when both X1 > µ1 and X2 > µ2 or X1 < µ1 and X2 < µ2 (with the obvious counterpart for negative). In particular, when negative correlation is present, you expect that when one sample mean is on the high side, the other will be on the low side.

To normalize the , the correlation ρ12 (also known as the correlation coefficient) is defined by

Cov(X1,X 2 ) ρ12 = σ1σ 2 Here we now have

-1 ≤ ρ12 ≤ 1 2 2 2 This follows from the fact that E(X1 )E(X2 ) ≥ [E(X1)E(X2)] (Schwarz’s inequality) The closer ρ is to 1 or –1 , the greater the dependence relationship between X1 and X2.

Simulation models typically employ random variables for input so the model outputs are also random. A is one for which the model outcomes are ordered over time, so in general, the collection of random variables representing one of the outputs of a simulation model over multiple runs, X1, X2, …, Xn is a stochastic process.

Reaching conclusions about the model from the implicit stochastic process may require making assumptions about the nature of the stochastic process that may not be true, but are necessary for the application of statistical analysis. For example, an assumption that the theoretical values of µ and σ are the same across X1, X2, …, Xn. Similarly, the assumption that the Cov(Xi. Xi+t) are the same for all t tends to hold only after a “warmup” period for the model to stabilize (the term “covariance stationary” is used to describe this situation). Confidence Intervals

When using data from multiple simulation runs, the objective is to make inferences based on the data provided by the runs. Point estimates (eg., a mean value) are one form of inference. Interval estimates (eg., a ) are another form of inference.

When collecting sample means Xi n ∑ Xi X()n = i=1 n we expect balance in the sense that E( X(n) ) = µ; ie., the collected set of sample means balance around the population mean µ. This is the intuitive basis underlying the , namely, that the sampled means distribute (approximately) normally around µ for large enough sample size n. Moreover, for large n, the mean and variance of the distribution of sample means are given (approximately) by µ and σ2/n.

Note that the Central Limit Theorem assumes that the random variables X1, X2, …, Xn corresponding to the sample means are independent and identically distributed. Since they derive from the same pdfs, they will in general be identically distributed, but will lack independence as noted by Law and Kelton; ie., we are likely to underestimate σ2 when working with the outcomes of simulation with the objective of estimating µ. The importance is more in the context of explaining why for smaller n, there may be a great deal of variance among the simulation outcomes.

σ2 is not a known, but if independence of the random variables can be assumed, it can be replaced by the sample S2(n) as discussed earlier.

If we normalize by switching to the sequence of random variables Zn given by X()n − µ Zn = σ2 n 2 then E(Zn) ≈ 0 and Var(Zn) = E(Zn ) ≈ 1; ie., the distribution approximates the standard (keep in mind that we are assuming that the sample means X1, X2, …, Xn are random variables which distribute normally as supported by the central limit theorem).

2 Substituting the sample variance in Zn for σ we get the values X(n) − µ t n = 2 S (n) n which can be shown to approximate a distribution referred to in the literature as Student’s t distribution (so-called because it was published under the pen name Student by W. S. Gossett in 1908).

Assuming independence throughout, the t distribution is given by W

V / n where W is given by the standard normal distribution and V is given by the chi-squared distribution with n degrees of freedom; ie., the with control values α = n/2 and β = 2. The calculations are usually given in a table based on “degrees of freedom” n ≥ 2 and “confidence” values γ.

We won’t get into the details regarding how the t distribution was originally arrived at, but technically, the t distribution with n degrees of freedom is given by the pdf ⎡n +1⎤ Γ ⎣⎢ 2 ⎦⎥ f (x) = (n+1) / 2 ⎡n ⎤ ⎛ x2 ⎞ πn ⋅ Γ⎢ ⎥ ⋅⎜1+ ⎟ ⎣ 2⎦ ⎝ n ⎠ and as might be expected closely approximates the normal distribution, particularly for large n. It’s key virtue is that it provides an estimate for what constitutes “large enough n”.

In our case, the tn values X(n) − µ t n = S2 (n) n approximate a t distribution with n-1 degrees of freedom (from the fact that the sample variance is over n-1 intervals).

For our sequence of random variables X1, X2, …, Xn we expect with probability approximately 1- α (shaded area below) that the values will fall in the interval [-z1- α /2, z1-α/2], where the notation z1- α /2 designates the symmetric points on the axis yielding area 1- α.

x2 1 − f(x) = e 2 2π

-z1-α/2 z1-α/2

z z2 1−α / 2 1 − 1-α = e 2 dz ∫ 2π −z1−α / 2 [-z1-α/2, z1-α/2] is called the confidence interval for 100(1-α) percent confidence. For a simulation outcome, either the confidence interval contains the mean or it does not. If the means are normally distributed, are independent, and one conducts a sufficiently large number of experiments, then the proportion of these whose values fall within the confidence interval should be approximately (1-α). Recall: With our data, we could then look at the interval X(n)− µ Zn = σ2 n S2 (n) X(n) ± z1−α / 2 X(n) − µ n t = n 2 S (n) The trouble is that this sheds no light on “sufficiently large”. n

Since the sequence t1, t2, … tn approximates a t distribution with n-1 degrees of freedom, we have a natural alternative for getting confidence intervals that do take into account the number of outcomes. Hence, for any finite n, we can use the interval S2 (n) X(n) ± t n−1,1−α / 2 n where tn-1,1-α/2 is the upper critical point with n - 1 degrees of freedom for a probability (area across the interval) of 1- α.

This is a measure of how precisely we know µ for the .

Standard normal Student t, 30 degrees of freedom Student t, 4 degrees of freedom Student t, 2 degrees of freedom Remarks: • the t distribution has longer tails than the normal distribution; ie., tn-1,1-α/2 > z1-α/2, so it is a more conservative testing mechanism. • if we increase n by a factor of 4 to 4n, then the control number for the confidence interval S2 (n) t n−1,1−α / 2 n is decreased by a factor of approximately 2.

• as n→∞, the values tn-1,1-α/2 → z1-α/2 • the approximate cumulative point differences (absolute values) and the cumulative ∆x = 0.125 area differences between the Student t and standard normal distributions for various degrees of freedom ranges as follows: Degrees of freedom Point difference Area difference 2 6.12% 6.23% 4 4.09% 1.88% 8 2.27% 0.50% 16 1.19% 0.14% 32 0.68% 0.05% 64 0.46% 0.02%

Hypothesis Testing

Law and Kelton report an interesting experiment that motivates hypothesis testing: • For each of normal, exponential, chi , lognormal, and hyperexponential distributions • For each of n=5, 10, 20 and 40 as sample sizes • Generate the sample and compute its average • Do 500 times and keep track of whether or not the average is in the confidence interval given by tn-1,1-α/2 In other words, for an of known mean µ and for n=10, 10 observations are generated, and the average computed. It is either in the confidence interval given by tn-1,1-α/2 or it is not.

Tabulation of the results for a 90 percent confidence interval: Distribution n=5 n=10 n=20 n=40 Normal 0.00 0.910 0.902 0.898 0.900 Exponential 2.00 0.854 0.878 0.870 0.890 Chi square (1 df) 2.83 0.810 0.830 0.848 0.890 Lognormal 6.18 0.758 0.768 0.842 0.852 Hyperexponential 6.43 0.584 0.586 0.682 0.774

As predicted by the Central Limit Theorem, for each distribution the coverage gets closer to 0.90 as n increases. Skewness evidently affects the size of the sample required. The skewness is a measure of symmetry, equal to 0 for a symmetric distribution. It is given by the ratio ∞ (x − µ)3f (x)dx E[](X − µ)3 ∫ = −∞ σ3 σ3 (contrast this with the self correlation measure E[(X-µ)2]/σ2 = 1)

In our scenario, we have have a sequence of random variables X1, X2, …, Xn corresponding to the sample means, and assume the criteria of the Central Limit Theorem are met, and so these are (approximately) normally distributed.

We wish to test the “null hypothesis” H0 that µ = µ0, where µ0 is a fixed hypothesized value for µ. Our expectation is that if

X(n) − µ0 is large too frequently, then H0 is not likely to be true.

If H0 is true, we now know that the tn statistic has a t distribution with n-1 degrees of freedom. Therefore, our hypothesis test for µ = µ0 is arguably best handled using the t statistic and is as follows: if |tn| > tn-1,1-α/2 then “reject” H0

This is the same as saying the test value falls outside of the confidence interval. The probability that the statistic falls into the rejection region when H0 is true is call the level of the test, typically chosen equal to 0.05 or 0.10. We expect to reject in no more than 5% or 10% of cases if H0 is true.

Two types of errors are possible when performing a hypothesis test 1. Type I error: H0 is rejected even though true 2. Type II error: H0 is accepted even though false

The probability α is under the experimenter’s control and bounds the probability of a Type I error. In contrast, for a fixed level α, the probability of a Type II error depends on what is actually true and in general has no bound under experimenter control.

If n is increased, then the Type II error count is reduced, but the bounding issue remains. For this reason, the terminology employed in practice is “reject” and “fail to reject”, since when we fail to reject, we don’t know with any definable certainty that H0 is true (ie., the test may not be powerful enough to make the distinction).

In general practice, the t statistic can be used if the distribution of the sample values is of the “mounded” variety. Example:

Suppose the assumed mean for a system is 3000 and we prepare a simulation model for the system. For 10 runs, we have 9 degrees of freedom. Suppose that we obtain the means µ1=3005, µ2=2996, µ3=2928, µ4=3003, µ5=2932, µ6=2937, µ7=2961, µ8=3015, µ9=2945, µ10=3001 The Null hypothesis H0 is that the mean for the simulation model is µ0=3000; ie., X(n) − 3000 = 0 From the Student t table, if we want 95% confidence, we want interval end points of -t9,0.975 and t9,0.975 (2.5% on each end). The tabulated values are then in the t0.025 column at degrees of freedom ν=9; ie., 2.26. For 99% confidence, the critical value is 3.25.

The computed values are X(10) = 2972.3 and S (10) = 34.84266 The computed t statistic is n X(10) − µ 2 0 ∑ []Xi − X()n t = = −2.51402 2 2 S ()n = i=1 S (10) n −1 10 falls outside of the confidence interval. Hence, we can reject the Null hypothesis with 95% confidence (meaning in this case 95% confidence that the mean is < 3000), but we cannot reject it with 99% confidence.

More to the point, we can construct a 95% confidence interval around the mean as S 2 (10) 34.84266 X (10) ± 2.26 ⋅ = 2972.3 ± 2.26 = 2972.3 ± 24.9 10 10 For the 99% case, the interval widens to ±35.8, which includes 3000. Note that if n is increased, the confidence interval shortens.

The t statistic is employed where data has mound characteristics. The utilized should be one that can arguably represent the sampled data. In testing a sequence of numbers for randomness, uniformity implies the test statistic must compare the sampled values to the uniform distribution, for example. A Specific Test for Randomness: The Runs Test

A sequence of values can be uniformly distributed but still exhibit characteristics of dependence.

Consider the sequence of integers 6, 1, 6, 1, 4, 6, 7, 6, 10, 7, 7, 3, 3, 1, 8, 2, 5, 9, 8, 5, 10, 4, 4, 4, 6, 3, 9, 10, 8, 7 Flag each number with a + if it is followed by a larger number and with a – if it is followed by a smaller number. - + - + + + - + - + - + - + - + + - - + - + + + - + + - -

Each string of +’s and –‘s forms a run, so our sequence of 30 numbers has 21 runs. There can be too few runs (e.g., the numbers simply ascend) or too many runs (e.g. alternating runs of length 1).

By using combinatorial counting methods (where every combination of +’s and –‘s is equally probable, in our case 230 of them), it can be shown that the mean number of runs µr for sequences of length N is given by 2N −1 16N − 29 µ = with variance σ2 = r 3 r 90

We know that since we are dealing with means, we can use the Student t distribution for the test statistic.

For N=30, µr = 19.67 and σr = 3.877. The Null hypothesis is that the sample mean is the same as µr; ie., with some degree of confidence, no pattern indicating dependence is present.

For our run count N=30, a=21, the t statistic is a −19.67 t = = 1.879 3.877 / 30

With 29 degrees of freedom, the t0.025 column value is 2.04 and for 99% confidence is 2.76. Hence, we fail to reject the Null hypothesis with 95% confidence, indicating that the test does not call independence into question by a significant amount (5%); ie., independence is not rejected based on this test. Note: because σr is known, the z statistic could have been used; namely, (a - µr)/σr = 0.34. The 99% value is 2.33. The 64% value is 0.36 (in which case the test does not call independence into question at least 36% of the time), demonstrating that the t-test is more conservative.