PAGANI: a Parallel Adaptive GPU Algorithm for Numerical Integration

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical Integration Ioannis Sakiotis Kamesh Arumugam Marc Paterno Old Dominion University NVIDIA Fermi National Accelerator Norfolk, Virginia, USA Santa Clara, California, USA Laboratory Batavia, Illinois, USA Desh Ranjan Balša Terzić Mohammad Zubair Old Dominion University Old Dominion University Old Dominion University Norfolk, Virginia, USA Norfolk, Virginia, USA Norfolk, Virginia, USA ABSTRACT ACM Reference Format: We present a new adaptive parallel algorithm for the challenging Ioannis Sakiotis, Kamesh Arumugam, Marc Paterno, Desh Ranjan, Balša Terzić, and Mohammad Zubair. 2021. PAGANI: A Parallel Adaptive GPU problem of multi-dimensional numerical integration on massively Algorithm for Numerical Integration. In Proceedings of St. Louis ’21: The parallel architectures. Adaptive algorithms have demonstrated the International Conference for High Performance Computing, Networking, Stor- best performance, but efficient many-core utilization is difficult to age, and Analysis (St. Louis ’21). ACM, New York, NY, USA, 12 pages. achieve because the adaptive work-load can vary greatly across https://doi.org/10.1145/nnnnnnn.nnnnnnn the integration space and is impossible to predict a priori. Existing parallel algorithms utilize sequential computations on independent processors, which results in bottlenecks due to the need for data 1 INTRODUCTION redistribution and processor synchronization. Our algorithm em- The capability to perform multi-dimensional numerical integration ploys a high-throughput approach in which all existing sub-regions is a recurrent need in applications across various fields such as are processed and sub-divided in parallel. Repeated sub-region clas- finance, physics and computer graphics. Examples of such applica- sification and filtering improves upon a brute-force approach and tions include risk management analysis and ray-tracing computa- allows the algorithm to make efficient use of computation and mem- tions [1][2]. Applications of particular interest to the authors are ory resources. A CUDA implementation shows orders of magnitude parameter estimation in cosmological models of galaxy clusters and speedup over the fastest open-source CPU method and extends the simulation of beam dynamics [3][4]. Two types of computational achievable accuracy for difficult integrands. Our algorithm typically methods are typically used for numerical integration - deterministic outperforms other existing deterministic parallel methods. and probabilistic. The two most desired features of a good computational method are accuracy and speed. Additionally, since none of CCS CONCEPTS the methods will produce accurate results all the time, it is equally • Computing methodologies; important that the methods provide a reasonable error-estimate on the integral estimates they compute. Most deterministic integra- KEYWORDS tion methods are quadrature based. Usually, the quadrature rules used to estimate the integral in a region require evaluation of the adaptive, multi-dimensional, integration, deterministic integrand on a number of points that grows exponentially with the number of dimensions. While this fact renders the deterministic Work supported by the Fermi National Accelerator Laboratory, managed and operated algorithms unsuitable for very high dimensions, where probabilistic by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. algorithms can still be employed to some extent, our experiments Department of Energy. The U.S. Government retains and the publisher, by accepting the have shown that on a CPU-platform, probabilistic algorithms such arXiv:2104.06494v2 [cs.DC] 23 Jun 2021 article for publication, acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of as Vegas, Suave, and Divonne are consistently outperformed by a this manuscript, or allow others to do so, for U.S. Government purposes. deterministic algorithm like Cuhre on a wide variety of integrals of FERMILAB-CONF-21-081-SCD. This work is supported by Jefferson Science Associates, LLC, under U.S. Department moderate dimensions [5][6][7]. Furthermore, deterministic quad- of Energy (DOE) Contract No. DE-AC05-06OR23177. rature methods are of particular importance to the experimental sciences, due to their computation of error-estimates that provide some transparency and confidence in the quality of integration Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed results. for profit or commercial advantage and that copies bear this notice and the full citation This paper presents a deterministic parallel algorithm for multi- on the first page. Copyrights for components of this work owned by others than ACM dimensional numerical integration suitable for GPUs that outper- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a forms existing deterministic methods. fee. Request permissions from [email protected]. In principle, numerical integration can be carried out very simply St. Louis ’21, November 15–19, 2021, St. Louis, MO - divide the integration region into “many” (<) smaller sub-regions, © 2021 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 “accurately” estimate the integral in each sub-region individually (퐼8 ) https://doi.org/10.1145/nnnnnnn.nnnnnnn and simply add these estimates to get an estimate for the integral St. Louis ’21, November 15–19, 2021, St. Louis, MO Ioannis Sakiotis, Kamesh Arumugam, Marc Paterno, Desh Ranjan, Balša Terzić, and Mohammad Zubair < integration sub-regions over the entire region (Σ8=1퐼8 ). If we use a simple way of creating the sub-regions, e.g. via dividing each dimension into 3 equal parts, assign to the boundaries of the sub-regions are easy to calculate and, if 3 processors is large enough, one might expect the integrand to not vary too much within each of the 3= individual smaller regions hence it easy to estimate the integral. Not only is this method simple, it is “embarrassingly parallel” as integral estimates for each region can be computed completely independently. Unfortunately, this approach = is infeasible for higher dimensions as 3 grows exponentially with =. Processors P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 For example if = = 10 and we need to split each dimension into 3 = 20 parts the number of sub-regions created would be 2010 which is Figure 1: Parallelization of adaptive integration routines, re- roughly 1013. Moreover, uniform division of the integration region quires the assignment of sub-regions to parallel processors. is not the best way to estimate the integral. The intuition is that The impossible to predict work-load imbalance is evident on the regions where the integrand is “well-behaved” do not need to processors 0 and 1, which reach a greater depth on the sub- be sub-divided finely to get a good estimate of the integral. Regions region tree and perform far more sub-divisions. where it is “ill-behaved” (e.g. sharp peaks, many oscillations) require finer sub-division for a reliable, accurate estimate. However, when devising a general numerical integration method, we cannot assume processors cannot decide when to terminate their local execution knowledge of the integrand’s behavior. Hence, we cannot split up or predict how many resources they require. This leads to either the integration region in advance with fewer (but perhaps larger costly synchronization/data redistribution or devoting significant in volume) sub-regions where the integrand is “well-behaved” and computational resources (both execution time and memory use) to greater number of smaller sub-regions in the region where it is regions that may not contribute significantly to either the estimate “ill-behaved”. These two reasons provide motivation for designing of the integral or the error estimate. adaptive numerical-integration methods where the determination We propose a new deterministic, parallel adaptive algorithm for of the behavior of the integrand in a region and creation of sub- multi-dimensional integration for massively parallel architectures. regions is done adaptively. A sub-region is split into smaller regions It is inspired by the Cuhre method of the Cuba library first intro- only if it is deemed that the integral estimate calculated for the duced in [7] and its parallel GPU-adaptation [12][15]. Unlike other sub-region is not accurate enough. parallel methods such as [12][15], the proposed PAGANI algorithm Several adaptive integration methods have been designed and im- does not utilize the common sequential scheme seen in adaptive plemented in the past and are available for use (Cuba, QUADPACK, integration. This avoids sequential computations in favor of greater NAG, and MSL) [8][7][9][10][11]. Most of these, however, can parallelization. PAGANI further differs from [12][15] in some key require prohibitively long computation times for high-dimensional aspects, specifically those pertaining to handling the termination integrands, especially when requiring high-levels of accuracy [12]. condition and load-balancing. Our tests on a standard test suite of As before, any adaptive integration method can itself be paral- integrals show that PAGANI is at least as accurate as Cuhre

Load more