A Comparison of Cpus, Gpus, Fpgas, and Massively Parallel Processor Arrays for Random Number Generation

A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation David B. Thomas Lee Howes Wayne Luk Imperial College London Imperial College London Imperial College London [email protected] [email protected] [email protected] ABSTRACT offered by parallel processing systems is still increasing with The future of high-performance computing is likely to rely each generation, through the introduction of multi-core ver- on the ability to efficiently exploit huge amounts of paral- sions of conventional CPUs, and through “unconventional” lelism. One way of taking advantage of this parallelism is computing platforms, such as FPGAs (Field Programmable to formulate problems as “embarrassingly parallel” Monte- Gate Arrays), GPUs (Graphics Processor Units), and MP- Carlo simulations, which allow applications to achieve a lin- PAs (Massively Parallel Processor Arrays). ear speedup over multiple computational nodes, without re- One application domain where it is often possible to both quiring a super-linear increase in inter-node communication. easily describe applications, and to take advantage of par- However, such applications are reliant on a cheap supply allel computational power, is Monte-Carlo simulation. This of high quality random numbers, particularly for the three is one of a class of “embarrassingly parallel” application domain maximum entropy distributions: uniform, used as a mains, where as the number of computational elements in- general source of randomness; Gaussian, for discrete-time creases, the communication required between elements does simulations; and exponential, for discrete-event simulations. not increase super-linearly. We know that Monte-Carlo ap- In this paper we look at four different types of platform: plications are able to scale (approximately) linearly across conventional multi-core CPUs (Intel Core2); GPUs (NVidia parallel architectures, but this doesn’t necessarily mean that GTX 200); FPGAs (Xilinx Virtex-5); and Massively Paral- the new generation of parallel architectures will be ideal for lel Processor Arrays (Ambric AM2000). For each platform Monte-Carlo; we also need to know that they can efficiently we determine the most appropriate algorithm for generat- support the types of computation found in such applica- ing each type of number, then calculate the peak generation tions. This includes standard building-blocks such as addi- rate and estimated power efficiency for each device. tion, multiplication, and division, but also requires Random Number Generators (RNGs). The importance of efficient RNGs is demonstrated by the Categories and Subject Descriptors enormous effort applied by researchers to traditional soft- B.8.2 [Hardware]: Performance and Reliability—Perfor- ware techniques, both to improve the efficiency of genera- mance Analysis and Design Aids tion techniques, and to modify simulations to reduce the number of random numbers consumed per run. Such is General Terms the cost of random numbers in software, that in the Gib- son method for simulating discrete-time biochemistry it is Algorithms, Design, Performance preferable to perform four floating-point operations (includ- ing a divide) just so that a previous random number can Keywords be re-used, rather than the simpler approach of generating a new random number. However, in an FPGA accelerated Monte-Carlo, Random Numbers, FPGA, GPU, MPPA simulation the economics change: it is actually cheaper to generate a new random number, with the added benefit that 1. INTRODUCTION the simulation code is easier to write and understand. The power of conventional super-scalar CPUs has steadily To understand how new parallel architectures can and increased for many decades, but we have now reached the should be used to accelerate Monte-Carlo, we need to prop- point where processor performance does not scale geomet- erly understand the costs of RNG primitives. In this pa- rically with passing generations. However, the performance per we consider how best to implement RNGs across four platforms, and then calculate the peak generation rate and power efficiency of each device. Our contributions are: An examination of random number generation on four Permission to make digital or hard copies of all or part of this work for • personal or classroom use is granted without fee provided that copies are different architectures: CPU, GPU, MPPA, and FPGA. not made or distributed for profit or commercial advantage and that copies Selection of RNGs for the uniform, Gaussian, and ex- bear this notice and the full citation on the first page. To copy otherwise, to • republish, to post on servers or to redistribute to lists, requires prior specific ponential distribution for each platform. permission and/or a fee. A comparison of the four platform’s potential for Monte- FPGA’09, February 22–24, 2009, Monterey, California, USA. • Copyright 2009 ACM 978-1-60558-410-2/09/02 ...$5.00. Carlo applications, both in terms of raw generation speed, and estimated power efficiency. tical randomness provided by TestU01, we also require that the empirical distribution is extremely accurate. To test the 2. REQUIREMENTS FOR RNGS marginal distribution, we require that the generator is able to pass a χ2 test with 216 equal probability buckets, for at Highly parallel architectures present challenges for RNG least 236 samples. designers due to the enormous quantities of random numbers generated and consumed during each application run. No specific requirements are made on data-types: if a gen- To ensure that we compare solutions that meet a common erator can pass the empirical quality tests, then whatever standard, we use the following requirements, which all the data-type (i.e. fixed-point or floating-point) it produces is generators benchmarked in this paper adhere to. assumed to be acceptable. Deterministic and Pseudo-Random: In this paper we only consider deterministic generators. This class of gener- 3. OVERVIEW OF RNG METHODS ator uses a deterministic state transition function, operat- There are a huge variety of methods for generating random ing on a finite-size state, and hence will produce a pseudo- numbers, both uniform and non-uniform, thanks to a stream random sequence that repeats with a fixed period. The rep- of theoretical advances and practical improvements over the etition property is critical in many applications, as it al- last 60 years. Devroye’s book [6] on the subject provides lows applications to be repeatedly executed using the same detailed coverage of methods up to the mid 1980s; more re- sequence of numbers, allowing “surprising” simulation runs cent developments in uniform generation are surveyed by to be re-examined, or to use the same random stimulus to L’Ecuyer [11], and Gaussian generators by Thomas [25]. drive different models. We also require that generators are There appear to be no recent surveys of exponential RNGs, pseudo-random, i.e. they should be designed to produce but many of the best techniques for Gaussian generation can a sequence that looks as random as possible (see later re- also be applied to exponential numbers. quirements on statistical quality). These requirements rules out both True Random Number Generators (TRNG) which 3.1 Uniform Generation use some source of physical randomness; and Quasi-Random The purpose of uniform RNGs is to produce a sequence Number Generators (QRNG) which produce a sequence that that appears as random as possible. Each generator has covers some multi-dimensional space “evenly”, rather than a finite set of states S, a deterministic stateless transition pseudo-randomly. function t from state to state, and a mapping m from each Period: The period of each random sequence must be at state to an output value in the continuous range [0, 1]. least 2160, which allows the overall sequence to be split up 64 64 t : S S m : S [0, 1] (1) into 2 sub-streams of length 2 , with an extra “safety- 7→ 7→ 32 factor” of 2 . Larger periods are of course desirable, and On execution the generator is started in some state s0 S. are achievable in software due to the high-ratio of memory Repeated application of t produces an infinite sequence∈ of to processing elements, but in highly-parallel devices it is successor states s1,s2, ..., and also a sequence of pseudo- likely to be difficult to allocate the large amounts of mem- random samples x1,x2, ...: ory needed to hold the state of each generator per element. However, parallel devices should be able to substitute in- si = t(si−1) xi = m(si) (2) creased processing per output sample for a large state, and As t is deterministic and S is finite, the output sequence so maintain statistical quality. must eventually repeat itself. The period p of the generator Stream-splitting: When executing Monte-Carlo applica- is defined as the shortest cycle within the sequence: tions in parallel it is critical that each node has an indepen- min : i : si+p = si 1 p S (3) dent stream of random numbers, completely independent of p ∀ ≤ ≤ | | the numbers on all other nodes. To ensure this indepen- dence we require that it must be possible to partition the One of the measures of efficiency of an RNG is how close stream into (at least) 264 non-overlapping sub-sequences of p is to S. In practise S is a vector of w storage bits, so a w length (at least) 264. Alternatively, the generator must have period close to 2 is desirable. a period so large that if 264 random sub-sequences of length The difficulty when creating RNGs is to balance the com- 264 are chosen, then the probability of any pair overlapping peting concerns. On the one hand we want extremely long is less than 2−64. periods, and sequences that are statistically random looking (even over long sub-sequences); but on the other, we want Empirical statistical quality: It is difficult to prove any- the computational cost per generated number to be very low, thing about the theoretical randomness of generators, par- and the size of the RNG state should not be excessive.

A Comparison of Cpus, Gpus, Fpgas, and Massively Parallel Processor Arrays for Random Number Generation

2.5 Classification of Parallel Computers

A Massively-Parallel Mixed-Mode Computer Designed to Support

Massively Parallel Computing with CUDA

CS 677: Parallel Programming for Many-Core Processors Lecture 1

Introduction to High Performance Computing

1 Parallelism

Embarrassingly Parallel

Core Processors

Embarrassingly Parallel Jobs Are Not Embarrassingly Easy to Schedule on the Grid

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Introduction to Parallel Computing

Parallel Computing Introduction Alessio Turchi