Estimation of Sample Mean and Variance for Monte-Carlo Simulations

Estimation of Sample Mean and Variance for Monte-Carlo Simulations David B. Thomas and Wayne Luk Imperial College London {dt10,wl}@doc.ic.ac.uk Abstract accurately accumulated to provide the overall answer. To keep up with the simulations this accumulation must Monte-Carlo simulations are able to provide estimates happen at full-speed, processing large streams of data of solutions for problems that are otherwise intractable, in one pass, and using constant resources to process one by examining the aggregate behaviour of large numbers result per clock-cycle. The accumulators must also be of random simulations. Because these simulations are able to maintain accuracy over large numbers of sam- independent and embarrassingly parallel, FPGAs are ples, to ensure that numerical stability problems do not increasingly being used to implement Monte-Carlo ap- bias the overall simulation results. plications. However, as the number of simulation runs This paper examines methods for the calculation of increases, the problem of accurately accumulating the mean and variance statistics over large data sets using aggregate statistics, such as the mean and variance, FPGAs. Our contributions are: becomes very difficult. This paper examines three accumulation methods, adapting them for use in FPGA • Identification of accumulation methods used in applications. In particular, we develop a mean and software that can be adapted to work in hardware. variance calculator based on cascading accumulators, An adaptation of the cascading accumulators which is able to process streams of floating-point data • method for hardware, allowing floating-point data in one pass, while operating in fixed-point internally. to be efficiently processed in fixed-point format. This method has the advantage that it calculates the ex- act sample mean and an accurate numerically stable • An empirical evaluation of the accuracy of five dif- sample variance, while using few logic resources and ferent hardware accumulation methods, showing providing performance to match commercial floating- that only the cascading accumulator and double- point operators: clock rates of 434MHz are achieved in precision methods maintain sufficient accuracy. Virtex-5, 1.46 times faster than a double-precision ac- • A comparison of area and speed, showing that the cumulator, while using one eighth of the resources. cascading accumulators method using one eighth of the logic resources of the double precision method, while operating 1.46 times faster. 1. Introduction Monte-Carlo simulations are a class of applications 2. Motivation that often map particularly well to FPGAs, due to the embarrassingly parallel nature of the computation. The At an abstract level Monte-Carlo simulations can huge number of independent simulation threads allow be described as a tuple (P,O,A, f ,a, p,d0): FPGA-based simulators to be heavily pipelined [6], and P : The set of possible simulation parameters, including also allow multiple simulation instances to be placed environmental parameters and the starting state. within one FPGA, providing a roughly linear increase O : The set of possible simulation run outputs, contain- in performance as the size of the FPGA is increased. A ing the results produced by a single simulation run. second advantage is the relatively low IO required per simulator instance, meaning that low performancebuses A : The set of accumulator states. can support very large amounts of computation without f : P 7→ O : A stochastic simulation function which becoming a bottleneck [8]. maps a simulation input to one possible result. This A side-effect of high-performance FPGA Monte- function must involve some randomness, so two execu- Carlo simulators is that they generate huge numbers of tions using the same simulation input will not provide results per second, and statistics on these results must be the same output. a : A × O 7→ A : A deterministic accumulation function is only necessary to estimate one quantile of the results; that combines the current accumulator state with the re- this occurs, for example, in Value-at-Risk calculations, sult from a simulation run. The functionshould be order where one wishes to estimate the 1% quantile of loss independent, i.e. a(a(d,e1),e2)= a(a(d,e2),e1). (how muchone mightlose every one in a hundreddays). p ∈ P : An element from the simulation input set giving In this paper we consider the simplest and most the starting point for the application. common type of accumulator: the arithmetic mean. This statistic estimates the expected value of the d ∈ A : The initial state of the accumulator. 0 stochastic process as the sample count increases to in- The simulation is then executed by generating a se- finity, and is used in many applications: for example, in ries of simulation results, and accumulating the results: pricing financial instruments one typically requires the di = a(di−1,ei) ei = f (p) (1) average price over all possible future outcomes, while in Monte-Carlo integration it is necessary to estimate ∞ As i → the estimate accumulated into di will asymp- the average occupancy over the integration domain. totically converge on the average properties of f (p). If we assume that the output of each simulation is In a hardware implementation it is convenient to a scalar value, then the sample mean (x) of the first n view each simulator instance as a black box, with one simulation runs is defined as: input that accepts p, and one output that produces a n stream of results derived from independent simulation 1 x = ∑ xi (2) runs. The rate of result generation will vary between n i=1 different simulations, and between different architec- tures implementing a given simulation. In the extreme For example, if the samples were generated by a Monte- case the simulator will generate one new result every Carlo simulation for option pricing, where each value xi cycle [4], while in other cases the simulator will on av- is the price of the option at the end of one simulation erage produce less than one result per cycle. trial, then x is an estimate of the true price of the option. Even when generating less than one result per cy- By itself the value of x is not very helpful, as it cle, the simulator may exhibit bursty behaviour, where gives no idea how much confidence can be placed in the estimate. The second key statistic is the sample vari- the simulator produces one result per cycle for a short 2 period of time. This may be somewhat predictable: if ance (s ), which provides an estimate of how much the for example the simulator uses a circular pipeline of individual samples are spread around the mean: depth k, and each simulation run requires m circuits of 1 n s2 = (x − x)2 (3) the pipeline, then there will be a burst of k results fol- n ∑ i lowedbyagapof k(m−1) cycles [6, 9]. However, more i=1 complicated simulations may take a random number of An important related statistic is the sample standard de- cycles to complete, such that the length of each burst viation (s), which is simply the square root of the vari- follows a probability distribution [7]. The tails of this ance. The advantage of the standard deviation is that it distribution may be very long, making the probability is measured in the same units as the data samples, while of a very long burst small, but still non-zero. the variance is measured in squared units. In such a situation it is important not to drop any The definitions of mean and variance are straight- results in the burst, as this may bias the overall result forward, but actually calculating the statistics is more of the simulation. For this reason it is necessary to de- complex, particularly for the large numbers of samples sign accumulatorsthat can handlethe worst case, so that generated by Monte-Carlo simulations: an FPGA might they can accept a sustained rate of one input value per be configured with 10 or more simulation cores, each cycle. This has the desirable side-effect of making the generating one result per cycle at 500MHz, so even one interface between the simulation black-box and the ac- second of execution provides over 232 samples, which cumulator modular, as the simulator can simply throw imposes a number of requirements on the accumulators. results out whenever they become ready, without need- The first requirement is that the accumulator must ing to worry about whether the downstream accumula- be able to operate in a streaming mode, using constant tor is ready to accept them. resources, and able to accept one new floating-point So far we have not specified exactly what the ac- value per cycle at the same clock rate as the floating- cumulator does, as there are a number of choices. In point cores used to calculate the value. In the formal some applications it is necessary to estimate the whole definition of variance (Equation 3) it is assumed that the empirical distribution function of the simulation results, mean has already been calculated. This is a problem for requiring complex histogram operations, but this is rel- streaming data applications, as it requires all the inter- atively uncommon. A slightly simpler case is when it mediate samples to be stored between the calculation of the mean and the variance. This is a particular problem 3. Existing Methods in an FPGA, as a RAM that is both large enough and In this section we identify a number of existing al- fast enough to buffer all the samples is likely to be very gorithms for calculating the mean and variance of data, expensive, so the accumulator must be able to operate and identify those that can be adapted for use in a hard- on just one pass through the data.

Load more