Fast Splittable Pseudorandom Number Generators
Total Page:16
File Type:pdf, Size:1020Kb
Fast Splittable Pseudorandom Number Generators Guy L. Steele Jr. Doug Lea Christine H. Flood Oracle Labs SUNY Oswego Red Hat Inc [email protected] [email protected] [email protected] Abstract 1. Introduction and Background We describe a new algorithm SPLITMIX for an object- Many programming language environments provide a pseu- oriented and splittable pseudorandom number generator dorandom number generator (PRNG), a deterministic algo- (PRNG) that is quite fast: 9 64-bit arithmetic/logical opera- rithm to generate a sequence of values that is likely to pass tions per 64 bits generated. A conventional linear PRNG ob- certain statistical tests for “randomness.” Such an algorithm ject provides a generate method that returns one pseudoran- is typically described as a finite-state machine defined by dom value and updates the state of the PRNG, but a splittable a transition function τ, an output function µ, and an ini- PRNG object also has a second operation, split, that replaces tial state (or seed) s0; the generated sequence is zj where the original PRNG object with two (seemingly) independent sj = τ(sj−1) and zj = µ(sj) for all j ≥ 1. (An alternate PRNG objects, by creating and returning a new such object formulation is sj = τ(sj−1) and zj = µ(sj−1) for all j ≥ 1, and updating the state of the original object. Splittable PRNG which may be more efficient when implemented on a super- objects make it easy to organize the use of pseudorandom scalar processor because it allows τ(sj−1) and µ(sj−1) to numbers in multithreaded programs structured using fork- be computed in parallel.) Because the state is finite, the se- join parallelism. No locking or synchronization is required quence of generated states, and therefore also the sequence (other than the usual memory fence immediately after ob- of generated values, will necessarily repeat, falling into a cy- ject creation). Because the generate method has no loops or cle (possibly after some initial subsequence that is not re- conditionals, it is suitable for SIMD or GPU implementation. peated, but in practice PRNG algorithms are designed so as We derive SPLITMIX from the DOTMIX algorithm of not to “waste state,” so that in fact any given initial state s0 Leiserson, Schardl, and Sukha by making a series of pro- will recur). The length L of the cycle is called the period of gram transformations and engineering improvements. The the PRNG, and si = sj iff i ≡ j mod L. end result is an object-oriented version of the purely func- Characteristics of PRNG algorithms and their implemen- tional API used in the Haskell library for over a decade, but tations that are of practical interest include period, speed, SPLITMIX is faster and produces pseudorandom sequences quality (ability to pass statistical tests), size of code, size of of higher quality; it is also far superior in quality and speed data (some algorithms require large tables or arrays), ease to java.util.Random, and has been included in Java JDK8 of jumping forward (skipping over intermediate states), par- as the class java.util.SplittableRandom. tionability or splittability (for use by multiple threads operat- We have tested the pseudorandom sequences produced ing in parallel), reproducibility (the ability to run a program by SPLITMIX using two standard statistical test suites twice from a specified starting state and get exactly the same (DieHarder and TestU01) and they appear to be adequate results, even if parallel computation is involved), and unpre- for “everyday” use, such as in Monte Carlo algorithms and dictability (how difficult it is to predict the next output given randomized data structures where speed is important. preceding outputs). These characteristics trade off, and so Categories and Subject Descriptors G.3 [Mathematics different applications typically require different algorithms. of Computing]: Random number generation; D.1.3 [Soft- A linear congruential generator (LCG) [13, x3.2.1] has ware]: Programming techniques—Concurrent programming as its state a nonnegative integer less than some modulus M General Terms Algorithms, Performance (which is typically either a prime number or a power of 2); Keywords collections, determinism, Java, multithreading, the transition function is τ(s) = (a · s + c) mod M and nondeterminism, object-oriented, parallel computing, pedi- the output function is just the identity function µ(s) = s. gree, pseudorandom, random number generator, recursive The Unix library function rand48 uses this algorithm with splitting, Scala, spliterator, splittable data structures, streams a = 25214903917 = 0x5DEECE66Dull, c = 11 = 0xB, and M = 248. Ever since the original JavaTM Language Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Specification [10] in 1995, the class java.util.Random for profit or commercial advantage and that copies bear this notice and the full citation has been specified to use this same algorithm (see Figure 1). on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, This code is simple, concise, and adequate for its originally to post on servers or to redistribute to lists, requires prior specific permission and/or a intended purpose: to supply pseudorandomly generated val- fee. Request permissions from [email protected]. ues to be used relatively infrequently by a smallish number OOPSLA ’14, October 20–24, 2014, Portland, OR, USA. Copyright c 2014 ACM 978-1-4503-2585-1/14/10. $15.00. of concurrent threads running within a set-top box or web http://dx.doi.org/10.1145/2660193.2660195 453 public class Random { unsigned long xorwow() { protected long seed; static unsigned long public Random() { x=123456789, y=362436069, z=521288629, this(System.currentTimeMillis()); } w=88675123, v=5783321, d=6615241; public Random(long seed) { setSeed(seed); } unsigned long t=(x^(x>>2)); synchronized public void setSeed(long seed) { x=y; y=z; z=w; w=v; v=(v^(v<<4))^(t^(t<<1)); this.seed = (seed ^ 0x5DEECE66DL) return (d+=362437)+v; } & ((1L << 48) - 1); } synchronized protected int next(int bits) { The period of XORWOW is pretty good (232(2160 − 1) = seed = (seed * 0x5DEECE66DL + 0xBL) 2192 − 232), and generation of one 32-bit value requires just & ((1L << 48) - 1); 9 arithmetic operations: 3 shifts, 4 XORs, and 2 adds. (There return (int)(seed >>> (48 - bits)); } are also a number of assignment operations, which could be public int nextInt() { return next(32); } optimized away in the context of a suitably unrolled loop.) public int nextLong() { Unfortunately, while XORWOW is also fairly fast, very recent return ((long)next(32) << 32) + next(32); } testing [29] has also revealed statistical flaws. public double nextDouble() { There are other PRNG algorithms with much longer pe- return (((long)next(26) << 27) + next(27)) riod that produce sequences of much higher quality, but they / (double)(1L << 53); } are slower. These include the Mersenne Twister [23], which } is related to LFSR algorithms in relying on bit-shifting and Figure 1. The essence of java.util.Random bitwise operations, and MRG32k3a [9], which is related to LCG algorithms in relying on computation of linear formu- browser. It has a number of drawbacks, however, that make lae modulo prime numbers. There are also many PRNG algo- it inappropriate for use in “serious” applications: (1) Its short rithms, such as those in java.security.secureRandom, period (consider that 248 nanoseconds is less than 80 hours). that produce pseudorandom sequences of superb quality, (2) While the rand48 algorithm was considered to be of rel- suitable for use in cryptographic and security applications, atively high quality when introduced in 1982 [28], more re- but their algorithms are substantially slower. cent tests [17] have uncovered significant flaws in its statis- The algorithms we have described so far are sequential tical behavior. (3) While it is thread-safe, thanks to its use (single-threaded), but there is a need for algorithms that are of synchronization locks, it is not thread-efficient: if many effective in a parallel (SIMD or multi-threaded) computing threads share a single instance of Random, then contention environment. One approach is to partition the cycle of an for the lock may become a performance bottleneck. (4) On otherwise sequential PRNG algorithm. This is simple if there the other hand, if many instances of class Random are cre- is a cheap way to “jump forward” n steps from any given ated, one for each thread, there is no guarantee that the val- state s by using some computational shortcut to compute ues collectively generated by those many instances will be as τ n(s). (For example, if τ(s) = (a · s) mod M then τ n(s) = statistically pseudorandom as if they had been generated by (an mod M) · s mod M, and it may be possible to compute a single instance of class Random. (If 256 threads each have (or precompute) an mod M cheaply.) This approach works a Random object with its initial seed chosen at random, and well when the number of threads N to be used is known then each thread generates 232 values, it is more likely than at the start of a computation: if a PRNG has period L, then not, thanks to the Birthday Paradox, that two of the gener- a global initial state s0 is chosen and then the PRNG for ib L c ated sequences will have some overlap.) thread i is given initial state τ N (s0). The MRG32k3a Another kind of finite-state machine used for this purpose algorithm does have a good shortcut for jumping ahead (it is the linear feedback shift register (LFSR), in which the requires precomputation of two 3×3 matrices), and there is a state is a vector of bits contained in a shift register; the tran- software package RngStream that uses such jumping ahead sition function shifts the register by one position, shifting in for partitioning its cycle into streams and substreams [19].