<<

arXiv:cond-mat/0609584v2 [cond-mat.other] 3 Jul 2007 admnmesfrlresaedsrbtdMneCrosimu Carlo Monte distributed scale large for numbers Random uofPirsCnr o hoeia hsc,University Physics, Theoretical for Centre Peierls Rudolf ntttfu hoeicePyi,Universit Physik, Theoretische f¨ur Institut ntttfu hoeicePyi,Universit Physik, Theoretische f¨ur Institut Universit el od xod X N,Uie Kingdom United 3NP, OX1 Oxford, Road, Keble 1 Universit nttt o hoeia Physics Theoretical for Institute tpaz2 91 adbr,Grayand Germany Magdeburg, 39016 2, atsplatz ¨ tpaz2 91 adbr,Germany Magdeburg, 39016 2, atsplatz ¨ tpa Mertens Stephan ek Bauke Heiko Preprint ∗ † tMagdeburg, at tMagdeburg, at ¨ ¨ fOxford, of lations . Random numbers for large scale distributed Monte Carlo simulations

Heiko Bauke∗ Institut f¨ur Theoretische Physik, Universitat¨ Magdeburg, Universitatsplatz¨ 2, 39016 Magdeburg, Germany and Rudolf Peierls Centre for Theoretical Physics, University of Oxford, 1 Keble Road, Oxford, OX1 3NP, United Kingdom

Stephan Mertens† Institut f¨ur Theoretische Physik, Universitat¨ Magdeburg, Universitatsplatz¨ 2, 39016 Magdeburg, Germany (Dated: April 26, 2007) Monte Carlo simulations are one of the major tools in statistical physics, complex system science, and other fields, and an increasing number of these simulations is run on distributed systems like clusters or grids. This raises the issue of generating random numbers in a parallel, distributed environment. In this contribution we demonstrate that multiple linear recurrences in finite fields are an ideal method to produce high quality pseu- dorandom numbers in sequential and parallel algorithms. Their known weakness (failure of sampling points in high dimensions) can be overcome by an appropriate delinearization that preserves all desirable properties of the underlying linear sequence.

PACS numbers: 02.70.-c, 02.70.Rr, 05.10.Ln

1. Introduction and we motivate their use as PRNGs. In section 3 we discuss various techniques for parallelizing a PRNG. We will see that only two of these techniques allow for parallel simulations The is a major industry and random num- whose results do not necessarily depend on the number of pro- bers are its key resource. In contrast to most commodities, cesses, and we will see how linear recurrences support these quantity and quality of have an inverse relation: parallelization techniques. Section 4 addresses the problem The more randomness you consume, the better it has to be [1]. of linear sequences to sample points in high dimensions, mea- The quality issue arises because simulations only approximate sured by the spectral test. We show that with a proper trans- randomness by generating a stream of deterministic numbers, formation (delinearization) we can generate a new sequence named pseudorandom numbers (PRNs), with successive calls that shares all nice properties with the original sequence (like to a pseudorandom number generator (PRNG). Considering parallelizability, equidistribution properties, etc.), but passes the ever increasing computing power, however, the quality of 10 many statistical tests that are sensitive to the hyperplane struc- PRNGs is a moving target. Simulations that consume 10 ture of points of linear sequences in high dimensions. In the pseudorandom numbers were considered large-scale Monte last two sections we discuss the quality of the linear and non- Carlo simulation a few years ago. On a present-day desktop linear sequences in parallel Monte Carlo applications and how machine this is a ten minute run. to implement these generators efficiently. More and more large scale simulations are run on parallel systems like networked workstations, clusters, or “the grid.” In a parallel environment the quality of a PRNG is even more important, to some extent because feasible sample sizes are 2. Recurrent randomness easily 10,...,105 times larger than on a sequential machine. The main problem is the parallelization of the PRNG itself, Almost all PRNGs producea sequence (r)= r1,r2,... of pseu- however. Some good generators are not parallelizable at all, dorandom numbers by a recurrence others lose their efficiency, their quality, or even both when being parallelized [2, 3]. ri = f (ri 1,ri 2,...,ri k), (1) − − − In this contribution we discuss linear recurrences in finite and the art of lies in the design of fields. They excel as sources of pseudorandomness because the function f . Numerous recipes for f have been discussed they have a solid mathematical foundation and they can easily in the literature [5, 6]. One of the oldest and most popular be parallelized. Their linear structure, however, may cause PRNGs is the linear congruential generator (LCG) problems in experiments that sample random points in high dimensional space. This is a known issue (“Random numbers ri = a ri 1 + b mod m (2) fall mainly in the planes” [4]), but it can be overcome by a · − proper delinearization of the sequence. introduced by Lehmer [7]. The additive constant b may The paper is organized as follows. In section 2 we briefly be zero. Knuth [8] proposed a generalization of Lehmer’s review some properties of linear recurrences in finite fields, method known as multiple recursive generator (MRG) [6, 9, 10]

ri = a1 ri 1 + a2 ri 2 + ... + an ri n mod m. (3) · − · − · −

∗E-mail:[email protected] Surprisingly, on a computer all recurrences are essentially lin- †E-mail:[email protected] ear. This is due to the simple fact that computers operate on 2 numbers of finite precision; let us say integers between 0 and A monic polynomial f (x) of degree n over Fm is primitive rmax. This has two important consequences: modulo m, if and only if it is irreducible (i.e., cannot be fac- torized over Fm), and if it has a primitive element of the ex- Every recurrence (1) must be periodic. tension field F n as one of its roots. The number of primitive • m polynomials of degree n modulo m is equal to φ(mn 1)/n = n − We can embed the sequence (r) into the finite field Fm, O (m /(nln(nlnm))) [20], where φ(x) denotes Euler’s totient • the field of integers with arithmetic modulo m, where m function. As a consequence a random polynomial of degree is a prime number larger than the largest value in the se- n is primitive modulo m with probability 1/(nln(nlnm)), quence (r). and finding primitive polynomials reduces≃ to testing whether a given polynomial is primitive. The latter can be done effi- The relevance of linear sequences is then based on the follow- ciently, if the factorization of mn 1 is known [11], and most ing fact [11, corollary 6.2.3]: computer algebra systems offer a− procedure for this. Theorem 1 All finite length sequences and all periodic se- Theorem 3 Let (r) be an LFSR sequence (3) with a primitive quences over a finite field Fm can be generated by a linear characteristic polynomial. Then each k-tuple (ri+1,...,ri+k) k n k ∈ recurrence (3). 0,1,...,m 1 occurs m − times per period for k n (ex- {cept the all zero− } tuple for k = n, which never occurs).≤ In the theory of finite fields, a sequence of type (3) is called linear feedback shift register sequence, or LFSR sequence for From this theorem it follows that, if a tuple of k successive short. Note that a LFSR sequence is fully determined by terms k n is drawn from a random position of a LFSR se- quence,≤ the outcome is uniformly distributed over all possible specifying n coefficients (a1,a2,...,an) plus n initial values k-tuples in Fm. This is exactly what one would expect from (r1,r2,...,rn). According to Theorem 1 all finite length sequences and all a truly random sequence. In terms of Compagner’s ensemble theory, tuples of length k n of a LFSR sequence with primi- periodic sequences are LFSR sequences. The linear complex- ≤ ity of a sequence is the minimum order of a linear recurrence tive characteristic polynomial are indistinguishable from truly that generates this sequence. The Berlekamp-Massey algo- random tuples [21, 22]. rithm [11, 12] is an efficient procedure for finding this short- Theorem 4 Let (r) be an LFSR sequence (3) with period T = est linear recurrence. The linear complexity of a random se- mn 1 and let α be a complex mth root of unity and α its quence of length T on average equals T /2 [13]. As a conse- complex− conjugate. Then quence a random sequence cannot be compressed by coding it as a linear recurrence, because T /2 coefficients plus T/2 T T if h = 0 mod T C h : ∑ αri αri+h (5) initial values have to be specified. Typically the linear com- ( ) = = i=1 · ( 1 if h = 0 mod T . plexity of a nonlinear sequence (1) is of the same order of − 6 magnitude as the period of the sequence. Since the design- C(h) can be interpreted as the autocorrelation function of ers of PRNGs strive for “astronomically” large periods, im- the sequence, and Theorem 4 tells us that LFSR sequences plementing nonlinear generators as a linear recurrence is not with maximum period have autocorrelations that are very sim- tractable. As a matter of principle, however, all we need to dis- ilar to the autocorrelations of a random sequence with period cuss are linear recurrences, and any nonlinear recurrence can T. Together with the nice equidistribution properties (Theo- be seen as an efficient implementation of a large order linear rem 3) this qualifies LFSR sequenceswith maximum period as recurrence. The popular PRNG “Mersenne Twister” [14], for pseudonoise sequences, a term originally coined by Golomb example, is an efficient implementation of a linear recurrence for binary sequences [11, 15]. in F2 of order 19937. There is a wealth of rigorous results on LFSR sequences that can (and should) be used to construct a good PRNG. Here 3. Parallelization we only discuss a few but important facts without proofs. A detailed discussion of LFSR sequences including proofs can be found in [11, 15–19]. 3.1. General parallelization techniques Since the all zero tuple (0,0,...,0) is a fixed point of (3), the maximum period of a LFSR sequence cannot ex- In parallel applications we need to generate streams t j,i of ran- ceed mn 1. The following theorem tells us precisely how to − dom numbers, where j = 0,1,..., p 1 numbers the streams choose the coefficients (a1,a2,...,an) to achieve this period for each of the p processes. We require− statistical indepen- [5]. dence of the t j,i within each stream and between the streams. Four different parallelization techniques are used in practice: Theorem 2 The LFSR sequence (3) over Fm has period mn 1, if and only if the characteristic polynomial Random seeding: All processes use the same PRNG but a − different“random”seed. The hope is that they will generate n n 1 n 2 f (x)= x a1x − a2x − ... an (4) nonoverlapping and uncorrelated subsequences of the orig- − − − − inal PRNG. This hope, however, has no theoretical foun- is primitive modulo m. dation. Random seeding is a violation of Donald Knuth’s 3

ri ri

t0,i t0,i

t1,i t1,i

t2,i t2,i

Figure 1: Parallelization by block splitting. Figure 2: Parallelization by leapfrogging.

advice, “Random number generators should not be chosen sequence such that at random” [5]. t0,i = rpi t = r Parameterization: All processes use the same type of genera- 1,i pi+1 . (8) tor but with different parameters for each processor. Exam- . ple: linear congruential generators with additive constant tp 1,i = rpi+(p 1) . b j for the jth stream [23]: − − Leapfrogging is illustrated in Figure 2. It is the most ver- t j,i = a t j,i 1 + b j mod m, (6) · − satile and robust method for parallelization and it does not require an a priori estimate of how many random numbers where the b j’s are different prime numbers just below will be consumed by each processor. An efficient imple- m/2. Another variant uses different multipliers a for mentation requires a PRNG that can be modified to gener- different streams [24]. The theoretical foundation of these ate directly only every kth element of the original sequence. methodsp is weak, and empirical tests have revealed serious Again this excludes many popular PRNGs. correlations between streams [25]. On a massive parallel system you may need thousands of parallel streams, and it Note that for a periodic sequence (r) the substreams derived is not trivial to find a type of PRNG with thousandsof “well from block splitting are cyclic shifts of the original sequence tested” parameter sets. (r), and for p not dividing the period of (r), the leapfrog se- quences are cyclic shifts of each other. Hence the leapfrog method is equivalent to block splitting on a different base se- Block splitting: Let M be the maximum number of calls to a quence. PRNG by each processor, and let p be the number of pro- cesses. Then we can split the sequence (r) of a sequential PRNG into consecutive blocks of length M such that 3.2. Playing fair

t0,i = ri We say that a parallel Monte Carlo simulation plays fair, if its t = r 1,i i+M outcome is strictly independent of the underlying hardware, . (7) . where strictly means deterministically, not just statistically. Fair play implies the use of the same PRNs in the same con- tp 1,i = ri+M(p 1) . − − text, independently of the number of parallel processes. It is mandatory for debugging, especially in parallel environments This method works only if we know M in advance or can at where the number of parallel processes varies from run to run, least safely estimate an upper bound for M. To apply block but another benefit of playing fair is even more important: Fair splitting it is necessary to jump from the ith random num- play guarantees that the quality of a PRNG with respect to an ber to the (i + M)th number without calculating the num- application does not depend on the degree of parallelization. bers in-between, which cannot be done efficiently for many Obviously, parametrization or random seeding as discussed PRNGs. A potential disadvantage of this method is that above prevent a simulation from playing fair. Leapfrog and long range correlations, usually not observed in sequential blocksplitting, on the other hand, do allowoneto use thesame simulations, may become short range correlations between PRNs within the same context independently of the number of substreams [26, 27]. Block splitting is illustrated in Fig- parallel streams. ure 1. Consider the site percolation problem. A site in a lattice of size N is occupied with some probability, and the occupancyis Leapfrog: The leapfrog method distributes a sequence (r) of determined by a PRN. M random configurations are generated. random numbers over p processes by decimating this base A naive parallel simulation on p processes could split a base 4 sequence into p leapfrog streams and having each process gen- Implementing leapfrogging efficiently is less straightfor- erate M/p lattice configurations, independently of the other ward. Calculating t j,i = rpi+ j via processes.≈ Obviously this parallel simulation is not equivalent to its sequential version that consumes PRNs from the base se- rpi+ j (n 1) rp(i 1)+ j (n 1) − − − − − quence to generate one lattice configuration after another. The . . effective shape of the resulting lattice configurations depends  .  = Ap  .  mod m (11) rpi+ j 1 rp(i 1)+ j 1 on the number of processes. This parallel algorithm does not  −   − −  play fair.  rpi+ j   rp(i 1)+ j     −  We can turn the site percolation simulation into a fair play-     p ing algorithm by leapfroggingon the level of lattice configura- is no option, because A is usually a dense matrix, in which tions. Here each process consumes distinct contiguous blocks case calculating a new element from the leapfrog sequence O 2 O of PRNs from the sequence (r), and the workload is spread requires (n ) operations instead of (n) operations in the over p processors in such a way, that each process analyzes base sequence. each pth lattice. If we number the processes by their rank i The following theorem assures that the leapfrog subse- from 0 to p 1 and the lattices from 0 to M 1, each process quences of LFSR sequences are again LFSR sequences [11]. starts with a− lattice whose number equals its− own rank. That This will provide us with a very efficient way to generate means process i has to skip i N PRNs before the first lattice leapfrog sequences. configuration is generated. Thereafter· each process can skip Theorem 5 Let r be a LFSR sequence based on a primitive p 1 lattices, i. e., (p 1) N PRNs and continue with the next ( ) − − · polynomial of degree n with period mn 1 (pseudonoise se- lattice. F − Organizing simulation algorithms such that they play fair is quence) over m, and let (t) be the decimated sequence with not always as easy as in the above example, but with a little lag p > 0 and offset j, e.g., effort one can achieve fair play in more complicated situations, t r (12) too. This may require the combination of block splitting and j,i = pi+ j . the leapfrog method, or iterated leapfrogging. Sometimes it Then t is a LFSR sequence based on a primitive polynomial is also necessary to use more than one stream of PRNs per ( j) of degree n, too, if and only if p and mn 1 are coprime, e.g., process, e.g., in the Swendsen Wang cluster algorithm one greatest common divisor gcd mn 1 p − 1. In addition, r may use one PRNG to construct the bond percolation clusters ( , )= ( ) and t are not just cyclic shifts of− each other, except when and another PRNG to decide to flip the cluster. ( j) p = mh (13)

3.3. Parallelization of LFSR sequences n for some 0 h < n. Ifgcd(m 1, p) > 1 the sequence (t j) is still a LFSR≤ sequence, but not− a pseudonoise sequence. As a matter of fact, LFSR sequences do support leapfrog and block splitting very well. Block splitting means basically It is not hard to find prime numbers m such that mn 1 has jumping ahead in a PRN sequence. In the case of LFSR se- very few (and large) prime factors. For such numbers,− the quences this can be done quite efficiently. Note, that by intro- leapfrog method yields pseudonoise sequences for any reason- ducing a companion matrix A the linear recurrence (3) can be able number of parallel streams, see also section 3.4 and ap- written as a vector matrix product [28]. pendix A. While Theorem 5 ensures that leapfrog sequences are not ri (n 1) 0 1 ... 0 ri n just cyclic shifts of the base sequence (unless (13) holds), the − − − ...... leapfrog sequences are cyclic shifts of each other. This is true  .  ...... =    mod m (9) for general sequences, not just LFSR sequences. Consider an ri 1 0 0 ... 1 ri 2  −  − arbitrary sequence (r) with period T. If gcd(T, p)= 1, all  ri  an an 1 ... a1ri 1    −  −  leapfrog sequences (t ),(t ),...,(t ) are cyclic shifts of each      1 2 p A other, i.e. for every pair of leapfrog sequences (t j1 ) and (t j2 ) of a common base sequence (r) with period T there is a con- From this formula| it follows{z immediately} that the M-fold suc- stant s, such that t j1,i = t j2,i+s for all i, and s is at least T /p . cessive iteration of (3) may be written as Furthermore, if gcd(T, p)= d > 1, the period of each leapfrog⌊ ⌋ sequence equals T /d and there are d classes of leapfrog se- ri (n 1) ri M (n 1) − − − − − quences. Within a class of leapfrog sequences there are p/d . .  .  = AM  .  mod m. (10) sequences, each sequence is just a cyclic shift of another and ri 1 ri M 1 the size of the shift is at least T /p .  −   − −  ⌊ ⌋  ri   ri M  Theorem 5 tells us that all leapfrog sequences of a LFSR    −      sequence of degree n can be generated by another LFSR of Matrix exponentiation can be accomplished in O n3 lnM degree n or less. The following theorem [11, Theorem 6.6.1] steps via binary exponentiation (also known as exponentiation gives us a recipe to calculate the coefficients (b1,b2,...,bn) of by squaring).  the corresponding leapfrog feedback polynomial: 5

Theorem 6 Let (t) be a (periodic) LFSR sequence over the only if its period mn 1 and the lag p are coprime. For that n − field Fm and f (x) its characteristic polynomial of degree n. reason m 1 should have a small number of prime factors − n Then the coefficients (b1,b2,...,bn) of f (x) can be computed [36]. It can be shown that m 1 has at least three prime fac- from 2n successive elements of (t) by solving the linear sys- tors and if the number of prime− factors does not exceed three, tem then m is necessarily a Sophie-Germain prime and n a prime larger than two, see appendix A for details. ti+n ti+n 1 ... ti+1 ti b1 − To sum up, the modulus m of a LFSR sequence should be ti+n+1 ti+n ... ti+2 ti+1 b2 a Sophie-Germain prime, such that mn 1 has no more than = (14)  .   . .. . .  .  three prime factors and such that m = 2e− k and k(k + 2) m ...... − ≤      for some integers e and k. ti+2n 1 ti+2n 2 ... ti+n ti+n 1bn  −   − −   over Fm. Starting from the base sequence we determine 2n values of 4. Delinearization the sequence (t) by applying the leapfrog rule. Then we solve (14) by Gaussian elimination to get the characteristic polyno- LFSR sequences over prime fields with a large prime modu- mial for a new LFSR generator that yields the elements of the lus seem to be ideally suited as PRNGs, especially in paral- leapfrog sequence directly with each call. If the matrix in (14) lel environments. They have, however, a well known weak- is singular, the linear system has more than one solution, and ness. When used to sample coordinates in d-dimensional it is sufficient to pick one of them. In this case it is always pos- space, pseudonoise sequences cover every point for d < n, and sible to generate the leapfrog sequence by a LFSR of degree every point except (0,0,...,0) for d = n. For d > n the set of less than the degree of the original sequence. positions generated is obviously sparse, and the linearity of the production rule (3) leads to the concentration of the sam- pling points on n-dimensional hyperplanes [37, 38], see also 3.4. Choice of modulus top of Figure 3. This phenomenon, first noticed by Marsaglia in 1968 [4], motivates one of the well known tests for PRNGs, the so-called spectral test [5]. The spectral test checks the LFSR sequences over F with sparse feedback polynomials 2 behavior of a generator when its outputs are used to form d- are popular sources of PRNs [5, 29, 30] and generators of this tuples. LFSR sequences can fail the spectral test in dimen- type can be found in various software libraries. This is due sions larger than the register size n. Closely related to this to the fact that multiplication in F is trivial, addition reduces 2 mechanism are the observed correlations in other empirical to exclusive-or and the mod-operation comes for free. As a tests such as the birthday spacings test and the collision test result, generators that operate in F are extremely fast. Un- 2 [39, 40]. Nonlinear generators do quite well in all these tests, fortunately, these generators suffer from serious statistical de- but compared to LFSR sequences they have much less nice fects [30–33] that can be blamed quite generally on the small and provable properties and they are not suited for fair play- size of the underlying field [34], see also section 6.1. In par- ing parallelization. allel applications we have the additional drawback, that, if the To get the best of both worlds we propose a delinearization leapfrog method is applied to a LFSR sequence with sparse that preserves all the nice properties of linear pseudonoise se- characteristic polynomial, the new sequence will have a dense quences: polynomial. The computational complexity of generating val- O O ues of the LFSR sequence grows from (1) to (n). Re- Theorem 7 Let (q) be a pseudonoise sequence in Fm, and F F member that for generators in 2, n is typically of order 1000 let g be a generating element of the multiplicative group m∗ . or even larger to get a long period 2n 1. Then the sequence (r) with The theorems and parallelization techniques− we have pre- q sented so far do apply to LFSR sequences over any finite field g i mod m if qi > 0 F . Therefore we are free to choose the prime modulus m. In ri = (15) m 0 if q = 0 order to get maximum entropy on the macrostate level [35] m ( i should be as large as possible. A good choice is to set m to is a pseudonoise sequence, too. a value that is of the order of the largest representable inte- ger of the computer. If the computer deals with e-bit registers, The proof of this theorem is trivial: since g is a generator of e F we may write the modulus as m = 2 k, with k reasonably m∗ , the map (15) is bijective. We call delinearized generators small. In fact if k(k + 2) m modular− reduction can be done based on Theorem 7 YARN generators (yet another random reasonably fast by a few≤ bit-shifts, additions, and multiplica- number). tions, see section 5.1. Furthermore a large modulus allows us The linearity is completely destroyed by the map (15), see to restrict the degree of the LFSR to rather small values, e.g., Figure 3. Let L(r)(l) denote the linear complexity of the sub- n 4, while the PRNG has a large period and good statistical sequence (r1,r2,...,rl). This function is known as the lin- properties.≈ ear complexity profile of (r). For a truly random sequence it In accordance with Theorem 5 a leapfrog sequence of a grows on average like l/2. Figure 4 shows the linear complex- pseudonoise sequence is a pseudonoise sequence, too, if and ity profile L(r)(l) of a typical YARN sequence. It shows the 6

1 l/(2T) linear complexity profile 0.5 0.8

0.4 0.6 T )/ l ( /1999

) 0.3 r ( +1 i L r 0.4 0.2

0.2 0.1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 r /1999 i l/T

1 Figure 4: Linear complexity profile L(r)(l) of a YARN sequence, pro- duced by the recurrence qi = 173 qi 1 +219 qi 2 mod 317 and ri = 151qi mod 317. The period of this· sequence− · equals− T = 3172 1. 0.8 −

can be done without overflow problems. But multiplying two 0.6 (w 1)-bit integers modulo m is not straightforward because the− intermediate product has 2(w 1) significant bits and can- /1999 − +1 i not be stored in a w-bit register. For the special case ak < √m r 0.4 Schrage [41] showed how to calculate ak ri k mod m without overflow. Based on this technique a portable· − implementation of LFSR sequences with coefficients ak < √m is presented 0.2 in [10]. For parallel PRNGs these methods do not apply be- cause the leapfrog method may yield coefficients that violate this condition. Knuth [5, section 3.2.1.1] proposed a general- 0 0 0.2 0.4 0.6 0.8 1 ization of Schrage’s method for arbitrary positive factors less r /1999 i than m, but this method requires up to twelve multiplications and divisions and is therefore not very efficient. Figure 3: Exponentiation of a generating element in a prime field The only way to implement (3) without additional measures is an effective way to destroy the linear structures of LFSR se- to circumvent overflow problems is to restrict m to m < 2w/2. quences. Both pictures show the full period of the generator. Top: On machines with 32-bit registers, 16 random bits per num- q ri = 95 ri i mod 1999. Bottom: ri = 1099 i mod 1999 with qi = · − ber is not enough for some applications. Fortunately todays 95 qi i mod 1999. · − C compilers provide fast 64-bit arithmetic even on 32-bit CPUs and genuine 64-bit CPUs become more and more com- mon. This allows us to increase m to 32. same growth rate as a truly random sequence up to the point where morethan 99% of the periodhas beenconsidered. Shar- ing the linear complexityprofile with a truly random sequence, 5.1. Efficient modular reduction we may say that the YARN generator is as nonlinear as it can get. Since the modulo operation in (3) is usually slower than other integer operations such as addition, multiplication, boolean operations or shifting, it has a significant impact on the to- 5. Implementation and efficiency tal performance of PRNGs based on LFSR sequences. If the modulus is a Mersenne prime m = 2e 1, however, the mod- ulo operation can be done using only a− few additions, boolean F LFSR sequences over prime fields larger than 2 have been operations and shift operations [42]. proposed as PRNGs in the literature [5, 28, 37]. An efficient 2 A summand s = ak ri k in (3) will never exceed (m 1) = implementation of these recurrences requires some care, how- (2e 2)2 and for each· positive− integer s [0,(2e 1)−2] there ever. is a− unique decomposition of s into ∈ − We assume that all integer arithmetic is done in w-bit regis- ters and m < 2w 1. Under this condition addition modulo m s = r 2e + q with 0 q < 2e . (16) − · ≤ 7

From this decomposition we conclude Table I: Generators of the TRNG library. All PRNGs denoted by F s r 2e = q trng::mrgn{s} are pseudonoise sequences over m with n feedback terms, trng::yarnn s denotes its delinearized counterpart. The −e · { } s r(2 1)= q + r prime modulus 231 1 is a Mersenne prime, whereas all other moduli − − − s mod (2e 1)= q + r mod (2e 1) are Sophie-Germain primes. − − and r and q are bounded from above by name m period trng::mrg2, trng::yarn2 231 1 262 e e 2 e e − ≈ q < 2 and r (2 2) /2 < 2 2 trng::mrg3, trng::yarn3 231 1 293 ≤⌊ − ⌋ − − ≈ trng::mrg3s, trng::yarn3s 231 21069 293 respectively, and therefore − ≈ trng::mrg4, trng::yarn4 231 1 2124 − ≈ q + r < 2e + 2e 2 = 2m. trng::mrg5, trng::yarn5 231 1 2155 − − ≈ trng::mrg5s, trng::yarn5s 231 22641 2155 So if m = 2e 1 and s (m 1)2, x = s mod m can be calcu- − ≈ lated solely by− shift operations,≤ − boolean operations and addi- tion, viz hold. Therefore if m = 2e k, s (m k)2 and k(k + 2) e e x = (s mod 2 )+ s/2 . (17) m, x = s mod m can be calculated− ≤ solely− by shift operations,≤ ⌊ ⌋ boolean operations and addition, viz If (17) yields a value x m we simply subtract m. ≥ From a computational point of view Mersenne prime mod- e e s′ = (s mod 2 )+ k s/2 uli are optimal and we propose to choose the modulus m = ⌊ ⌋ (19) 31 e e 2 1. This is the largest positive integer that can be rep- x = (s′ mod 2 )+ k s′/2 . ⌊ ⌋ resented− by a signed 32-bit integer variable, and it is also a Mersenne prime. On the other hand our theoretical consid- If (19) yields a value x m, a single subtraction of m will erations favor Sophie-Germain prime moduli, for which (17) complete the modular reduction.≥ To carry out (19) twice as does not apply directly. But one can generalize (17) to moduli many operations as for (17) are needed. But (19) applies for 2e k [43]. Again we start from a decomposition of s into all moduli m = 2e k with k(k +2) m. Note that on systems − with big enough word− size (such as≤ 64-bit architectures), just s = r 2e + q with 0 q < 2e , (18) · ≤ doing the mod-operation might well be faster than using (19). and conclude

s r 2e = q − · 5.2. Fast delinearization s r(2e k)= q + kr − − s mod (2e k)= q + kr mod (2e k). − − YARN generators hide linear structures of LFSR sequences q by raising a generating element g to the power gqi mod m. The sum s′ = q + kr exceeds the modulus at most by a factor ( ) k + 1, because by applying This can be done efficiently by binary exponentiation, which takes O (logm) steps. But considering LFSR sequences with q < 2e and r (2e k 1)2/2e < 2e k 1, only a few feedback taps (n 6) and m 231 even fast expo- ≤⌊ − − ⌋ − − nentiation is significantly more≤ expensive≈ than a single itera- we get the bound tion of (3). Therefore we propose to implement exponentia- tion by table lookup. If m is a 2e -bit number we apply the q + kr < 2e + k(2e k 1) = (k + 1)m. ′ − − decomposition In addition, by the decomposition of s′ = q + kr e′ qi = qi,1 2 + qi,0 with e e · (20) s′ = r′ 2 + q′ with 0 q′ < 2 , e e · ≤ qi 1 = qi/2 ′ , qi 0 = qi mod 2 ′ , , ⌊ ⌋ , it follows and use the identity e e e s mod (2 k)= s′ mod (2 k)= q′ + kr′ mod (2 k), − − − e qi 2 ′ qi 1 qi 0 and this time the bounds ri = g mod m = (g ) , g , mod m (21) · e e e q′ < 2 and r′ (k + 1)(2 k)/2 < k + 1 qi ≤⌊ − ⌋ to calculate g mod m by two table lookups and one multipli- 31 2e′ q and cation modulo m. If m < 2 the tables for (g ) i,1 mod m and gqi,0 mod m have 216 and 215 entries, respectively. These e q′ + kr′ < 2 + k(k + 1)= m + k(k + 2) 384 kbytes of data easily fit into the cache of modern CPUs. 8

ical and theoretical test procedures. Empirical tests take a fi- Table II: Performance of various PRNGs from the TRNG library ver- nite sequence of PRNs and compute certain statistics to judge sion 4.0 [44] and the GNU Scientific Library (GSL) version 1.8 [45]. The test program was compiled using the Intel C++ compiler ver- the generator as “random” or not. While empirical tests focus sion 9.1 with high optimization (option -O3) and was executed on only on the statistical properties of a finite stream of PRNs a Pentium IV 3 GHz. and ignore all the details of the underlying PRNG algorithm, theoretical tests analyze the PRNG algorithm itself by number- TRNG GSL theoretic methods and establish a priori characteristics of the generator PRNs per sec. generator PRNs per sec. PRN sequence. These a priori characteristics may be used trng::mrg2 48 106 mt19937a 41 106 to choose good parameter sets for certain classes of PRNGs. · · trng::mrg3 43 106 r250b 85 106 The parameters of the LFSR sequences used in TRNG for ex- · · trng::mrg3s 36 106 gfsr4c 77 106 ample were found by extensive computer search [38] to give · · trng::mrg4 38 106 ran2d 27 106 good results in the spectral test [5]. · · In this section we apply statistical tests to the generators trng::mrg5 38 106 ranluxe 8 106 · · shown in Table I to investigate their statistical properties and trng::mrg5s 23 106 ranlux389f 5 106 · · those of its leapfrog substreams. The test procedures are mo- trng::yarn2 26 106 · tivated by problems that occur frequently in physics, and that trng::yarn3 23 106 are known as sensitive tests for PRNGs, namely the cluster · trng::yarn3s 16 106 Monte Carlo simulation of the two-dimensional Ising model · trng::yarn4 20 106 and random walks. Furthermore we analyze the effect of the · trng::yarn5 22 106 exponentiation step of the YARN generator in the birthday · spacings test. Results of additional empirical tests are pre- trng::yarn5s 13 106 · sented in the documentation of the software library TRNG aMersenne Twister [14] [44]. b shift-register sequence over F2, ri = ri 103 ri 250 [29], see also sec- tion 6.1 − ⊕ − c shift-register sequence over F2, ri = ri 471 ri 1586 ri 6988 ri 9689 [30] dNumerical Recipes generator ran2 [46]− ⊕ − ⊕ − ⊕ − 6.1. Cluster monte carlo simulations eRANLUX generator with default luxury level [47] fRANLUX generator with highest luxury level [47] In [31] it is reported, that some “high quality” generators pro- duce systematically wrong results in a simulation of the two- 5.3. TRNG dimensional Ising model at the critical temperature of the in- finite system by the Wolff cluster flipping algorithm [49]. De- viations are due to the bad mixing properties of the generators LFSR sequences and their nonlinear counterparts have been (low macrostate entropy) [35]. An infamous example for such implemented in the TRNG software package, a portable and a bad PRNG is the r250 generator [29]. It combines e LFSR highly optimized library of parallelizable PRNGs. Paralleliza- sequences over F to produce a stream of e-bit integers via tion by block splitting as well as by leapfrogging are sup- 2 ported by all generators. TRNG is publicly available for ri = ri 103 ri 250 , (22) download [44]. Its design is based on a proposal for the − ⊕ − next revision of the ISO C++ standard [48]. TRNG uses where denotes the bitwise exclusive-or operation (addition 64-bit-arithmetic, fast modular reduction (17) and (19) and ⊕ in F2). exponentiation by table lookup (21) to implement PRNGs Figure 5 shows the results of cluster Monte Carlo simula- based on LFSR sequences over prime fields, with Mersenne tions of the Ising model on a 16 16 square lattice with cyclic or Sophie-Germain prime modulus. Table I describes the gen- boundary conditions for the generator× r250 and its leapfrog erators of the software package. substreams. Each simulation was done ten times at the critical Table II shows some benchmark results. Apparently the per- temperature applying the Wolff cluster algorithm. The mean formance of both types of PRNGs competes well with popu- of the internal energy EMC, the specific heat cMC, and the em- lar PRNGs such as the Mersenne Twister or RANLUX. Abso- pirical standard deviation of both quantities have been mea- lute as well as relative timings in Table II depend on compiler sured. The gray scale in Figure 5 codes the deviation from the and CPU architecture, but the table gives a rough performance exact values [50]. Black bars indicate deviations larger than measure of our PRNGs. four standard deviations from the exact values. The original r250 sequence and its leapfrog sequences yield bad results if the number of processes is a power of two. Since the statis- tical error decreases as the number of Monte Carlo updates 6. Quality increases, these systematic deviations become more and more significant and a PRNG with bad statistical properties will re- It is not possible to prove that a given PRNG will work well veal its defects. Figure 5 visualizes the inverse relation be- in any simulation, but one can subject a PRNG to a battery of tween quantity and quality of streams of PRNs mentioned in tests that mimic typical applications. One distinguishes empir- the introduction. 9

c MC 0.504 300 E 2 walkers MC c 4 walkers MC 0.503 100 E 8 walkers MC c 16 walkers MC 0.502 30 E MC c MC 10 E 0.501 MC c MC

3 t thousand MC updates

E γ MC 0.5 c MC 1 E MC 0.499 2 4 6 8 10 12 14 16 p 0.498

c MC 0.497 30 E MC c MC

10 E 0.496 MC 0 500 1000 1500 2000 2500 c MC t 3 E MC c MC 1 E Figure 6: The number of sites visited by N random walkers after t MC γ γ c time steps is SN (t) t . The plot shows the exponent against time MC

million MC updates ∼ 0.3 E MC for generator trng::yarn3. c MC 0.1 E MC 0.501 10 20 30 40 50 60 p 0.5005 0 1 2 3 4 |E − E |/∆E and |c − c |/∆c MC exact MC MC exact MC ∞

γ 0.5

Figure 5: Results of Monte Carlo simulations of the two-dimensional Ising model at the critical temperature using the Wolff cluster flipping 0.4995 algorithm and the generator r250 (top) and trng::mrg3s (bottom). Only each pth random number of the sequence was used during the simu- 0.499 2 4 6 8 10 12 14 16 lations. See the text for details. p

Figure 7: Results of the S test for generator trng::mrg5; correspond- The apparent dependency of the quality on the leapfrog pa- N ing results for other generators in Table I look similar. See the text for rameter p is easily understood: If p is a power of two, the details. leapfrog sequence is just a shifted version of the original se- quence [15, 30]. On the other hand, if p is not a power of two, the leapfrog sequence corresponds to a LFSR sequence 6.2. Random walks with a denser characteristic polynomial. For p = 3, e.g., the recursion reads The simulation of the two-dimensional Ising model in the pre-

ri = ri 103 ri 152 ri 201 ri 250 , vious section measured the quality within distinct substreams. − ⊕ − ⊕ − ⊕ − Potential interstream correlations are not taken into account. and for p = 7, In [53] some tests are proposed that are designed to detect cor- relations between substreams. One of them is the so-called ri = ri 103 ri 124 ri 145 ri 166 ri 187 SN test. − ⊕ − ⊕ − ⊕ − ⊕ − ⊕ For the SN test N random walkers are considered. They ri 208 ri 229 ri 250 , − ⊕ − ⊕ − move simultaneously and independently on the one-dimen- respectively. The quality of a LFSR generator usually in- sional lattice, i.e., on the set of all integers. At each time step creases with the number of nonzero coefficients in its charac- t each walker moves one step to the left or to the right with teristic polynomial [9, 51, 52], a phenomenon that can be eas- equal probability. For large times t the expected number SN(t) ily explained theoretically [34], and that also holds for LFSR of sites that have been visited by at least one walker reaches γ sequences over general prime fields [35]. the asymptotic form SN(t) t with γ = 1/2. In the SN test ∼ While LFSR sequences over F2 with sparse characteristic the exponent γ is measured and compared to the asymptotic polynomial are known to fail in cluster Monte Carlo simula- value γ = 1/2. tions, LFSR sequences over large prime fields with dense char- Let p be the number of parallel streams or walkers. The acteristic polynomial perform very well in these tests. This motion of each random walker is determined by a leapfrog can be seen in the lower half of Figure 5 for the trng::mrg3s stream ti, j based on the generators from Table I. The exponent generator; other generators from Table I perform likewise. γ is determined for up to 16 simultaneous random walkers as 10

8 a function of time t. After about 1000 time steps the exponent 10 has converged and fluctuates around its mean value. LFSR sequence YARN sequence Figure 6 shows some typical plots of the exponent γ against time t for generator trng::yarn3. The exponent γ was deter- 106 mined by averaging over 108 random walks. Results for the other generators in Table I look similar and are omitted here. Figure 7 presents the asymptotic values of γ as a function of 104 the number of walkers for generator trng::mrg5. The mean λ

value of γ averaged over the interval 1500 < t < 2500 is plot- / 2 ted and the error bars indicate the maximum and the minimum 10 value of γ in 1500 < t < 2500. Almost all numerical results are in good agreement with the theoretical prediction γ 1 2 = / 0 and for other generators we get qualitatively the same results. 10 Therefore we conclude that there are no interstream correla- tions detectable by the SN test. −2 10 8 10 12 14 16 18 20 22 24 log M 2

6.3. Birthday spacings test LFSR sequence

The birthday spacings test is an empirical test procedure that was proposed by Marsaglia [5, 54]. It is specifically designed 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 log M to detect the hyperplane structures of LFSR sequences [39]. 2 For the birthday spacings test N days 0 d M are uni- i < YARN sequence formly and randomly chosen in a year of M≤days. After sort- ing these independent identically distributed random numbers into nondecreasing order b(1),b(2),...,b(N), N spacings 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 log M 2 s1 = b b , (2) − (1) s2 = b(3) b(2) , 0 0.2 0.4 0.6 0.8 1 − p . . Figure 8: Due to their lattice structure LFSR generators fail the birth- sN = b + M b (1) − (N) day spacings test. The nonlinear mapping of YARN sequences is are defined. The birthday spacings test examines the distri- an effective technique to destroy these lattice structures. Top: mean number R of equal spacings between M randomly chosen birthdays. bution of these spacings by sorting them into nondecreasing For a goodh i PRNG R /λ = 1 is expected. Bottom: gray coded p- order s(1),s(2),...,s(N) and counts the number R of equal spac- value of a chi-squareh testi for the distribution of R. ings. More precisely, R is the number of indices j such that 1 j < N and s = s . The test statistics R is a random ≤ ( j) ( j+1) number with distribution This bijective mapping transforms points in a d-dimensional 3 r space [0,1,...,l 1]d to the linear space [0,1,...,M 1] and [N /(4M)] 1 − − pR(r)= + O . (23) points on regular hyperplanes are transformed into regular eN3/(4M)r! N   spacings between points in [0,1,...,M 1]. − For N ∞ and M ∞ such that N3/(4M) λ = const., To demonstrate the failure of LFSR sequences in the birth- → → → day spacings test we applied the test to the LFSR sequence pR(r) converges to a Poisson distribution with mean λ. Af- ter repeating the birthday spacings test several times one can ri = 17384 ri 1 + 12391 ri 2 mod65521 (25) apply a chi-square test to compare the empirical distribution · − · − of R to the correct distribution (23). If the p-value of the chi- and its YARN counterpart square test is very close to one or zero, the birthday spacings test has to be considered failed [5]. q 20009 i mod65521 if qi > 0 In order to make the birthday spacings test sensitive to the ri = with hyperplane structure of linear sequences, the birthdays are (0 if qi = 0 (26) arranged in a d-dimensional cube of length l, i.e., M ld. = qi = 17384 qi 1 + 12391 qi 2 mod 65521 Each day bi is determined by a d-tuple of consecutive PRNs · − · − (rdi,rdi+1,...,rdi+d 1), with the test parameters λ = N3/(4M) 1 and d = 6. Beyond − ≈ d 1 a certain value of M the LFSR sequence starts to fail the birth- l rdi j b = ∑− · + l j . (24) day spacings test due to its hyperplane structure, see Figure 8. i m j=0   The hyperplanestructure of LFSR sequences give rise to birth- 11 day spacings that are much more regular than randomly cho- A. Sophie-Germain prime moduli sen birthdays. As a consequence we observe a value R that is too large, see the top of Figure 8. The nonlinear transformation (15) in the YARN sequences The maximal period of a LFSR sequence (3) over the prime F n destroys the hyperplane structure. In fact, even those YARN field m is given by m 1. We are looking for primes m such that mn 1 has for a fixed− n a small number of prime factors. sequences whose underlying linear part fails the birthday spac- − ings test, passes the same test with flying colors (Figure 8). For m > 2 the factor m 1 is even and the smallest possible number of prime factors of− mn 1 is three, YARN generators start to fail the birthday spacings test only − for trivial reasons, i.e., when M gets close to the period of n m 1 2 m 1 the generator. For the experiment shown in Figure 8 we av- m 1 = 2 − (1 + m + m + ... + m − ). (A1) eraged over 5000 realizations of the M birthdays, and each − · 2 · birthday was determined by d = 6 random numbers. Then we If (m 1)/2 is also prime, m is called a Sophie-Germainprime expect that the YARN generator starts to fail if M 6 5000 ' − n 2 17 · · or safe prime. For m = 2 there exist values for n such that 2 65521 1, i.e., M ' 2 . n − − 1 is also prime. primes of form 2 1 are called Mersenne Of course the period of the PRNGs (25) and (26) is artifi- primes, e.g., 22 1 and286243 1 are− prime. cially small and the birthdayspacings test was carried out with − − If n is composite and m > 2 the period T is a product of at these particular PRNGs to make the effect of hyperplanestruc- least four prime factors, tures and its elimination by delinearization as explicit as pos- sible. “Industrially sized” LFSR and YARN generators like kl k k l 1 k l 2 m 1 = (m 1) [(m ) − + (m ) − + ... + 1] those in Table I share the same structural features, however. − − · k 1 k 2 So the results for PRNGs (25) and (26) can be assumed to = (m 1) (m − + m − + ... + 1) (A2) − · · present generic results for LFSR and YARN sequences, re- k l 1 k l 2 [(m ) − + (m ) − + ... + 1]. spectively. Actually, the number of prime factors for composite n is even larger. The factorization of mn 1 is related to the factoriza- − 7. Conclusions tion of the polynomial n fn(x)= x 1 (A3) We have reviewed the problem of generating pseudorandom − numbers in parallel environments. Theoretical and practical over Z. This polynomial can be factored into cyclotomic poly- considerations suggest to focus on linear recurrences in prime nomials Φk(x) [55], number fields, also known as LFSR sequences. These se- n quences can be efficiently parallelized by block splitting and fn(x)= x 1 = ∏Φd(x). (A4) leapfrogging, and both methods enable Monte Carlo simula- − d n | tions to play fair, i.e., to yield results that are independent of the degree of parallelization. We also discussed theoreti- The coefficients of the cyclotomic polynomials are all in Z; cal and practical criteria for the choice of parameters in LFSR so the number of prime factors of the factorization of mn 1 − sequences. We then introduced YARN sequences, which are is bounded from below by the number of cyclotomic polyno- derived from LFSR sequences by a bijective nonlinear map- mials in the factorization of xn 1. This number equals the − ping. A YARN sequence inherits the pseudonoise properties number of natural numbers that divide n. If n is prime then from its underlying LFSR sequence, but its linear complexity fn(x) is just the product Φ1(x) Φn(x). · is that of a true random sequence. YARN sequences share all If m is a primelargerthan two, from Φ1(x)= x 1 it follows the advantages of LFSR sequences, but they pass all tests that that mn 1 can be factorized into at least three factors,− namely LFSR sequences tend to fail due to their low linear complex- − ity. n m 1 m 1 = 2 − ∏ Φd(m). (A5) All our results have been incorporated into TRNG, a pub- − · 2 · d>1; d n licly available software library of portable, parallelizable pseu- | dorandom number generators. TRNG complies with the pro- The period of a maximal period LFSR sequence with linear posal for the next revision of the ISO C++ standard [48]. complexity n over Fm is a product of exactly three factors, if and only if m is a Sophie-Germain prime, n is prime, and Φn(m) is prime also. Note, Φ2(m)= m + 1 is never prime, if m is an odd prime. Let us investigate some special cases in Acknowledgments more detail.

This work was sponsored by the European Community’s FP6 Case n = 2 : The period m2 1 is a product of 23 3 and at Information Society Technologies programme under contract least two other factors. Using− the sieve of Eratosthenes· IST-001935, EVERGROW, and by the German Science Coun- modulo 12 it can be shown, that each large enough prime cil DFG under grant ME2044/1-1. m can be written as m = 12 k + c, where k and c are · 12

can be written as m = 60 k + c, where k and c are integers Table III: A collection of Sophie-Germain primes m for which mn 1 such that gcd(60,c)= 1.· Factoring the period m4 1 = has a minimal number of prime factors. − (60 k + c)2 1 we find − · − nm nm 4 4 1 231 525 5 231 46365 (60 k + 1) 1 = 2 3 5 k (30k + 1) − − · − · · · · 231 69 231 22641 (1800k2 + 60k + 1) − − 263 5781 263 594981 · 4 4 63 − 63 − (60 k + 7) 1 = 2 3 5 (10k + 1) (15k + 2) 2 4569 2 19581 · − · · · · − − 2 2 231 37485 6 231 4398621 (360k + 84k + 5) − − · 231 2085 231 1120941 4 4 − − (60 k + 11) 1 = 2 3 5 (5k + 1) (6k + 1) (A7) 263 927861 263 122358381 · − · · · · − − (1800k2 + 660k + 61) 263 156981 263 29342085 · − − 4 4 3 231 43725 7 231 50949 (60 k + 13) 1 = 2 3 5 (5k + 1) (30k + 7) − − · − · · · · 231 21069 231 6489 (360k2 + 156k + 17) − − 263 275025 263 92181 · − − andso on... 263 21129 263 52425 − − 4 231 305829 − Case n = 6 : This case is similar to the n = 2 and n = 4 cases. 231 119565 − Applying the sieve of Erastosthenes modulo 84 it can be 263 3228621 shown that m6 1 is a productof 23 32 7 and at least four − 263 156981 other factors. − · · − Cases n = 3, n = 5, n = 7 : Here n is prime, and therefore the period mn 1 has at least three factors. The number of integers such that gcd(12,c) = 1. Factoring the period −n 2 2 factors of m 1 will not exceed three, if m is a Sophie- m 1 = (12 k + c) 1 we find − − · − Germain prime and Φn(m) is prime, too. (12 k + 1)2 1 = 23 3 k (6k + 1) · − · · · In Table III we present a collection of Sophie-Germain primes 2 3 (12 k + 5) 1 = 2 3 (2k + 1) (3k + 1) m, for which mn 1 has a minimal number of prime factors. · − · · · (A6) − (12 k + 7)2 1 = 23 3 (3k + 2) (2k + 1) For n prime an extended table can be found in [56]. If n is · − · · · prime, its factorization can be found by (A4), and if n = 2 or (12 k + 11)2 1 = 23 3 (k + 1) (6k + 5). · − · · · n = 4 by (A6) and (A7), respectively. All these primes are good candidates for moduli of LFSR sequences as PRNGs in Case n = 4 : The period m4 1 is a productof 24 3 5 and at parallel applications. Note that the knowledge of the factoriza- least three other factors.− Using the sieve of Erastosthenes· · tion of mn 1 is essential for an efficient test of the primitivity modulo 60 it can be shown, that each large enough prime m of characteristic− polynomials.

[1] B. Hayes, American Scientist 89, 300 (2001). [9] S. Tezuka, Uniform Random Numbers: Theory and Practice [2] P. D. Coddington, NHSE Review (1996), http://nhse.cs.rice. (Kluwer Academic Publishers, Boston, 1995). edu/NHSEreview/RNG/. [10] P. L’Ecuyer, F. Blouin, and R. Couture, ACM Transactions on [3] S. L. Anderson, SIAM Review 32, 221 (1990). Modeling and Computer Simulations 3, 87 (1993). [4] G. Marsaglia, Proceedings of the National Academy of Sci- [11] D. Jungnickel, Finite Fields: Structure and Arithmetics (Bibli- ences 61, 25 (1968). ographisches Institut, 1993), ISBN 3-411-16111-6. [5] D. E. Knuth, The Art of Computer Programming, vol. [12] J. Massey, IEEE Transactions on Information Theory 15, 122 2: Seminumerical Algorithms (Addison Wesley Professional, (1969). 1998), 3rd ed. [13] A. J. Menezes, Handbook of Applied , The CRC [6] P. L’Ecuyer, in Handbook of Computational Statistics, edited Press Series on Discrete Mathematics and its Applications by J. E. Gentle, W. H¨ardle, and Y. Mori (Springer, 2004), ISBN (CRC Press Inc., 1996). 3-540-40464-3. [14] M. Matsumoto and T. Nishimura, ACM Transactions on Mod- [7] D. H. Lehmer, in Proc. 2nd Symposium on Large-Scale Digital eling and Computer Simulation 8, 3 (1998). Calculating Machinery, Cambridge, MA (Harward University [15] S. W. Golomb, Shift Register Sequences (Aegan Park Press, La- Press, 1951), pp. 141–146. guna Hills, CA, 1982), revised ed. [8] D. E. Knuth, The Art of Computer Programming, vol. [16] R. Lidl and H. Niederreiter, Introduction to finite fields and their 2: Seminumerical Algorithms (Addison Wesley Professional, applications (Cambridge University Press, 1994), 2nd ed. 1969), 1st ed. [17] R. Lidl and H. Niederreiter, Finite Fields, vol. 20 of Encyclope- 13

dia of Mathematics and its Applications (Cambridge University 105. Press, 1997), 2nd ed. [40] P. L’Ecuyer and P. Hellekalek, in Random and Quasi-Random [18] J. Fillmore and M. Marx, SIAM Review 10, 342 (1968). Point Sets (Springer, 1998), vol. 138 of Lecture Notes in Statis- [19] N. Zierler, J. Soc. Indust. Appl. Math. 7, 31 (1959). tics, pp. 223–266. [20] Z.-X. Wan, Lectures on Finite Fields and Galois Rings (World [41] L. Schrage, ACM Transactions on Mathematical Software 5, Scientific, 2003). 132 (1979). [21] A. Compagner, American Journal of Physics 59, 700 (1991). [42] W. H. Payne, J. R. Rabung, and T. P. Bogyo, Communications [22] A. Compagner, Journal of Statistical Physics 63, 883 (1991). of the ACM 12, 85 (1969). [23] O. E. Percus and M. H. Kalos, Journal of Parallel and Dis- [43] M. Mascagni and H. Chi, Parallel Computing 30, 1217 (2004). tributed Computing 6, 477 (1989). [44] Tina’s Random Number Generator Library, http://tina.nat. [24] M. Mascagni, Parallel Computing 24, 923 (1998). uni-magdeburg.de/TRNG/. [25] A. D. Matteis and S. Pagnutti, Parallel Computing 13, 193 [45] GNU Scientific Library, http://www.gnu.org/software/gsl/. (1990). [46] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flan- [26] A. D. Matteis and S. Pagnutti, Parallel Computing 14, 207 nery, Numerical Recipes in C (Cambridge University Press, (1990). Cambridge, 1992), 2nd ed. [27] J. Eichenauer-Herrmann and H. Grothe, Numerische Mathe- [47] M. L¨uscher, Computer Physics Communications 79, 100 matik 56, 609 (1989). (1994). [28] P. L’Ecuyer, Communications of the ACM 33, 85 (1990). [48] W. E. Brown, M. Fischler, J. Kowalkowski, and M. Paterno, [29] S. Kirkpatrick and E. P. Stoll, Journal of Computational Physics Random Number Generation in C++0X: A Comprehensive 40, 517 (1981). Proposal, version 2 (2006), http://www.open-std.org/jtc1/sc22/ [30] R. M. Ziff, Computers in Physics 12, 385 (1998). wg21/docs/papers/2006/n2032.pdf. [31] A. M. Ferrenberg, D. P. Landau, and Y. J. Wong, Physical Re- [49] U. Wolff, Physical Review Letters 62, 361 (1989). view Letters 69, 3382 (1992). [50] A. E. Ferdinand and M. E. Fisher, Physical Review 185, 832 [32] P. Grassberger, Physics Letters A 181, 43 (1993). (1969). [33] L. N. Shchur, J. R. Heringa, and H. W. J. Bl¨ote, Physica A 241, [51] D. Wang and A. Compagner, Mathematics of Computation 60, 579 (1997). 363 (1993). [34] H. Bauke and S. Mertens, Journal of Statistical Physics 114, [52] M. Matsumoto and Y. Kurita, ACM Transactions on Modeling 1149 (2004). and Computer Simulation 6, 99 (1996). [35] S. Mertens and H. Bauke, Physical Review E 69, 055702(R) [53] I. Vattulainen, Physical Review E 59, 7200 (1999). (2004). [54] G. Marsaglia, in Computer Science and Statistics: Proceed- [36] P. L’Ecuyer, Operations Research 47, 159 (1999). ings of the Symposium on the Interface, 16th, Atlanta, Georgia, [37] A. Grube, Zeitschrift f¨ur angewandte Mathematik und March 1984, edited by L. Billard (1985). Mechanik 53, T223 (1973). [55] Z.-X. Wan, Lectures on Finite Fields and Galois Rings (World [38] P. L’Ecuyer, ACM Transactions on Modeling and Computer Scientific, Singapore, 2003). Simulation 3, 87 (1993). [56] P. L’Ecuyer, Operations Research 47, 159 (1999). [39] P. L’Ecuyer, in Proceedings of the 2001 Winter Simulation Con- ference, IEEE (IEEE Press, Washington, DC, 2001), pp. 95–