Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA
Total Page:16
File Type:pdf, Size:1020Kb
Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA Elise de Doncker, John Kapenga and Rida Assaf Abstract The rapidly evolving CUDA environment is well suited for numerical in- tegration of high dimensional integrals, in particular by Monte Carlo or quasi-Monte Carlo methods. With some care, near peak performance can be obtained on impor- tant applications. A basis for efficient numerical integration using kernels in CUDA GPUs is presented, showing several ways to use CUDA features, provide automatic error control and prevention or detection of roundoff errors. This framework allows easy extension to multiple GPUs, clusters and clouds for addressing problems that were impractical to attack in the past, and is the basis for an update to the PARINT numerical integration package. 1 Introduction Problems in computational geometry, computational physics, computational fi- nance, and other fields give rise to computationally expensive integrals. This chap- ter will address applications of CUDA programming for Monte Carlo integration. A Monte Carlo method approximates the expected value of a stochastic process by sampling, i.e., by performing function evaluations over a large set of random points, and returns the sample average of the evaluations as the end result. The functions that are found in the application areas mentioned above are usually not smooth and may have singularities within the integration domain, which enforces the generation of large sets of random numbers. The algorithms described here will be incorporated in the PARINT multivariate integration package, where we are concerned with the automatic computation of an integral approximation Elise de Doncker, John Kapenga and Rida Assaf Western Michigan University, 1903 W. Michigan Avenue, Kalamazoo MI 49008 e-mail: felise.dedoncker,john.kapenga,[email protected] 1 2 Elise de Doncker, John Kapenga and Rida Assaf Z Q[D] f ≈ I [D] f = f (x) dx; D and an estimate or bound E [D] f for the error E[D] f = jI [D] f − Q[D] f j: Thus PARINT is set up as a black box, to which the user specifies the integration problem including a requested accuracy and a limit on the number of integrand evaluations. In its current form, PARINT supports the computation of multivariate integrals over hyper-rectangular and simplex regions. The user specifies the region D by the upper and lower integration limits in case of a hyper-rectangular region, or by the vertex coordinates of a simplex. The integrand f : D ! Rk is supplied as a function written in C (or which consists of C callable code). The desired accuracy is determined by an absolute and a relative error tolerance, ea and er; respectively, and it is the objective to either satisfy the prescribed accuracy, jQ[D] f − I [D] f j ≤ E [D] f ≤ maxfea;erjI [D] f jg; (1) within the allowed number of function evaluations, or flag an error condition if the limit has been reached. In the remainder of this chapter we will assume that the integrand f is a single-valued function (k = 1), and D is assumed to be the d- dimensional unit hypercube, D = [0;1]d (and will generally be omitted from the notation). The available approaches in PARINT include adaptive, quasi-Monte Carlo and Monte Carlo integration. For N uniform random points xi; the Monte Carlo approx- imation, N ¯ 1 Q f = f = ∑ fi; (2) N i=1 with fi = f (xi); is suitable for moderate to high dimensions d; or an irregular inte- grand behavior or integration domain. In the latter case, the domain can be enclosed in a cube, where the integrand is set to zero outside of D: The main elements of MC methods are the error bounding and the underlying Pseudo-Random Number Generators (PRNGs), which are discussed in Sections 2 and 3, respectively. In particular, Section 3 covers the implementation and use of parallel PRNGs in CUDA. Section 5 describes dynamic parallelism, which became available with CUDA- 5.0, for GPU cards supporting compute capability 3.0 and up. The recursive or “vertical” version in Section 5.1, and the iterative or “horizontal” version in Sec- tion 5.2 are utilized for an efficient GPU implementation of automatic integration in Section 5.3, which alleviates problems with memory restrictions and kernel launch overhead. Section 6 further addresses issues with accuracy and stability of the sum- mations involved in the MC approximation of the result and the sample variance. In Appendix A we include results showing speedups and accuracy for a Feynman loop integral arising in high energy physics [17, 7]. Previously we covered problems from computational geometry in [5]. Monte Carlo Integration with Dynamic Parallelism in CUDA 3 2 MC Convergence In the following we give a derivation of the error estimate for the one-dimensional case [4]; a similar formulation can be given for d > 1: Let x1;x2;::: be random variables drawn from a probability distribution with R ¥ density function m(x); −¥ m(x) dx = 1; and assume that the expected value of f ; R ¥ I = −¥ f (x) m(x) dx exists. For example, x1;x2;::: may be selected at random from a uniform random distribution in [0;1]; m(x) = 1 for 0 ≤ x ≤ 1; and 0 otherwise. According to the Central Limit Theorem, the variance Z ¥ Z ¥ s 2 = ( f (x) − I )2 m(x) dx = f 2(x) m(x) dx − I 2 (3) −¥ −¥ allows for an integration error bound in terms of the probability ls 1 P E f ≤ p = PI + O(p ) (4) N N 2 where PI represents the confidence level PI = p1 R l e−x =2 dx as a function of 2p −l l: For example, PI = 50% for l = 0:6745; PI = 90% for l = 1:645; PI = 95% for l = 1:96; PI = 99% for l = 2:576: In other words we canp state that, e.g., with probability or confidence level of 99% thep error E ≤ 2:567s= N: The behavior of the error bound in (4) represents the 1= N law of MC integration. Note that (4) assumes that s exists, which relies on the integrand being square-integrable, even though that is not required for the convergence of the MC method [4]. To calculate the error estimate we employ the sample variance estimate N N N 2 1 ¯ 2 1 2 ¯ ¯2 s¯ = ∑( fi − f ) = (∑ fi − 2 ∑ fi f + N f ) N − 1 i=1 N − 1 i=1 i=1 N 1 2 ¯2 ¯2 = (∑ fi − 2N f + N f ) N − 1 i=1 N 1 2 ¯2 = (∑ fi − N f ): (5) N − 1 i=1 We also make use of simple antithetic variates, which is known as a variance reduc- tion method, where for each sample point x in (2), an evaluation is also performed at the point x∗ = 1 − x (symmetric with respect to the centroid of the cube) [9]. Thus (2) becomes 1 N f + f ∗ Q f = ∑ i i ; (6) N i=1 2 and we update the variance estimate of (5) as 4 Elise de Doncker, John Kapenga and Rida Assaf N ∗ 2 N ∗ 2! 1 fi + f ∑ ( fi + f ) s¯ 2 = ∑ i − N i=1 i : N − 1 i=1 2 2N In view of (4) we can then use 1 N ∗ 2 N ∗ 2! 2 l fi + fi ∑i=1( fi + fi ) E ≤ p ∑ − N (7) N(N − 1) i=1 2 2N as an error estimate or bound (under certain conditions). There are a number of additional variance reduction techniques that can be ap- plied to Monte Carlo integration in addition to antithetic variates, such as control variates, importance sampling and stratified sampling. These can be included in a numerical integration package, such as PARINT, using the framework presented here, but are not the focus of the chapter and will not be discussed here. 3 Pseudo-Random Number Generators (PRNGs) A pseudo-random number generator (PRNG) is a deterministic procedure to gener- ate a sequence of numbers that behaves like a random sequence and can be tested for adherence to relevant statistical properties. Several PRNG testing packages are available [20, 2, 18]. Marsaglia’s DIEHARD [20, 2] is nowadays being superseded by the more stringent tests in SmallCrush, Crush and BigCrush by L’Equyer [18] in the TESTU01 framework. Whereas function sampling for Monte Carlo integration is considered embarrass- ingly parallel, some applications will involve large sets of evaluations, thus requir- ing efficient high quality random number streams on various multilevel architectures containing distributed nodes, multicore processors and many-core accelerators. In particular, good pseudo-random sequences are required for multiple threads or pro- cesses, and a much larger period of the sequence may be needed in view of much intensified sampling and larger problems. Choosing new PRNGs may also necessi- tate validation of streams of random numbers generated by different methods being used together. The brief overview of parallel PRNGs in this section is based on our AIAA SciTech 2014 paper [6]. Solutions for associating good quality streams of random numbers to many threads can be sought by [19]: (i) employing a PRNG with a long period and gen- erating subsequences started with different seeds; (ii) dividing a random number sequence with a long period into non-overlapping subsequences; and (iii) using dif- ferent PRNGs of the same class with different parameters. Continuous progress has been made in this area, and there are several open source PRNG packages in wide use, such as the Scalable Parallel Random Number Gen- erators Library, (SPRNG) [32]. To assure statistically independent streams of ran- dom numbers, the SPRNG library incorporates particular parametrizations of: ad- Monte Carlo Integration with Dynamic Parallelism in CUDA 5 ditive lagged-Fibonacci, prime modulus linear congruential, and maximal-period shift-register generators.