Scientific Application on the Summit Supercomputer
Total Page:16
File Type:pdf, Size:1020Kb
2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT) Accelerating DCA++ (Dynamical Cluster Approximation) Scientific Application on the Summit supercomputer ¶ Giovanni Balduzzi¶ Arghya Chatterjee¶ Ying Wai Li† Peter W. Doak ETH Zurich Oak Ridge National Laboratory Los Alamos National Laboratory Oak Ridge National Laboratory 8093 Zurich, Switzerland Oak Ridge 37831, U.S. Los Alamos 87545, U.S. Oak Ridge 37831, U.S. [email protected] [email protected] [email protected] [email protected] Urs Haehner Ed F. D’Azevedo Thomas A. Maier Thomas Schulthess‡‡ ETH Zurich Oak Ridge National Laboratory Oak Ridge National Laboratory ETH Zurich 8093 Zurich, Switzerland Oak Ridge 37831, U.S. Oak Ridge 37831, U.S. 8093 Zurich, Switzerland [email protected] [email protected] [email protected] [email protected] Abstract—Optimizing scientific applications on today’s tion overhead on the time-to-solution for the full machine run. accelerator-based high performance computing systems can be Our hardware agnostic optimizations are designed to alleviate challenging, especially when multiple GPUs and CPUs with the communication and I/O challenges observed, while improving heterogeneous memories and persistent non-volatile memories the compute intensity and obtaining optimal performance on a are present. An example is Summit, an accelerator-based system complex, hybrid architecture like Summit. at the Oak Ridge Leadership Computing Facility (OLCF) that Index Terms—DCA, Quantum Monte Carlo, QMC, CUDA, is rated as the world’s fastest supercomputer to-date. New CUDA aware MPI, Summit@OLCF, Spectrum MPI strategies are thus needed to expose the parallelism in legacy applications, while being amenable to efficient mapping to the I. INTRODUCTION underlying architecture. In this paper we discuss our experiences and strategies to port With rapidly changing microprocessor designs, the next a scientific application, DCA++, to Summit. DCA++ is a high- generation high performance computers will have massive performance research application that solves quantum many- amounts of hierarchical memory available on the node, from body problems with a cutting edge quantum cluster algorithm, the dynamical cluster approximation. user-managed caches, DRAM, high bandwidth memory and Our strategies aim to synergize the strengths of the different non-volatile memory (NVMs). On the other hand, exploring programming models in the code. These include: (a) stream- the best ways to integrate multiple programming models for lining the interactions between the CPU threads and the GPUs, collective optimization of performance remains one of the (b) implementing computing kernels on the GPUs and decreasing biggest challenges. Systems like OLCF’s Summit1 generate the CPU-GPU memory transfers, (c) allowing asynchronous GPU communications, and (d) increasing compute intensity by com- majority of floating point operations per second (FLOPS) from bining linear algebraic operations. GPUs (as high as 97%), which are throughput optimized with Full-scale production runs using all 4600 Summit nodes a large number of threads, while CPUs are latency optimized. attained a peak performance of 73.5 PFLOPS with a mixed Thus, finding the right balance between memory access and precision implementation. We observed a perfect strong and weak computation work distribution, according to the underlying scaling for the quantum Monte Carlo solver in DCA++, while encountering about 2× input/output (I/O) and MPI communica- heterogeneous hardware without sacrificing efficiency and scalability, is still an outstanding research problem. This manuscript has been co-authored by UT-Battelle, LLC under The first step to addressing this issue is to understand how Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. “asynchronous” execution models can be used to: (a) describe The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government units of work; (b) let the runtime system efficiently schedule retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or work to hide the latency in accessing various memory hierar- reproduce the published form of this manuscript, or allow others to do so, for chies and NUMA domains; and (c) determine if heterogeneous United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance threads are a possible solution. Current programming models with the DOE Public Access Plan (http://energy.gov/downloads/doe-public- that provide “rigid” synchronization constructs (e.g., full sys- access-plan). ¶ tem barriers at the end of loops) may not be able to scale due All three authors have equal contribution to the manuscript. † Ying Wai Li contributed to this work mostly during her previous to the high overhead of the massive parallelism with threads. appointment at Oak Ridge National Laboratory, Oak Ridge 37831, U.S. ‡‡ Thomas Schulthess is also affiliated with Swiss National Supercom- 1World’s fastest supercomputer with a theoretical performance of 200 puting Center (CSCS), 6900 Lugano, Switzerland. peta-FLOPS (PFLOPS) as of June 2019 [1]. 978-1-7281-3613-4/19/$31.00 ©2019 IEEE 432 DOI 10.1109/PACT.2019.00041 In this work, we discuss the programming styles and between lattice sites and interact through an on-site Coulomb strategies that were used in the development of DCA++. repulsion. Formally, the Hamiltonian is given by DCA++ implements a quantum cluster method [2] known as † H = H0 + H = −t c c + U ni↑ni↓ . (1) the dynamical cluster approximation (DCA) [2]–[4], with a int iσ jσ i,j,σ i quantum Monte Carlo (QMC) kernel for modeling strongly correlated electron systems. The DCA++ code currently uses The first term of (1), where i, j indicates that the sum three different programming models (MPI, CUDA, and C++ runs over nearest-neighbor sites i and j, represents the electron Standard threads), together with numerical libraries (BLAS, hopping with amplitude t. The second term, where the sum LAPACK and MAGMA), to expose the parallelization in runs over all lattice sites i, captures the on-site Coulomb computations. Optimizing one single programming model does interaction of strength U. The index σ ∈{↑, ↓} represents not necessarily help us achieve our efficiency / scalability the electron spin. Systems with multiple electron bands per goal. We discuss our vision on how we may use a general lattice site are also supported. programming style or strategy(ies) to exploit the underlying Mathematically, the Hamiltonian is represented by a matrix. memory hierarchy. Solving for the possible energy levels and states of the system is equivalent to solving for the eigenvalues and eigenstates A. Contribution of the Hamiltonian matrix. However, exact diagonalization The primary contributions of this work are outlined below: studies of the 2D Hubbard model are restricted to very small (a) Custom thread pool for on-node parallelization and man- lattices as the problem size scales exponentially with the aging GPU driver threads. number of lattice sites. Quantum Monte Carlo simulations, (b) Improved delayed sub-matrix updates to increase GPU in turn, are plagued by the infamous fermion sign problem, utilization. which again limits the accessible lattice sizes and prevents (c) Asynchronous communication among multiple GPUs on calculations at low temperatures [7]. To study Hubbard model Summit. type of problems, dynamical mean field theory (DMFT) [8] (d) Accumulation of complex measurements ported from the has become one method of choice. In DMFT the complexity CPU to GPU, accounting for significant performance of the infinite lattice problem is reduced by mapping it to a improvement in runtime and GPU utilization. self-consistent solution of an effective impurity model, thereby (e) Delayed Non-uniform Fast Fourier Transform in single- treating only spatially local correlations exactly and including particle accumulation. non-local correlations on a mean-field level. In order to treat (f) Batched matrix-matrix multiplication for small 2D additional non-local correlations that are important to under- Fourier transforms in two-particle accumulation. stand the mechanism of high-temperature superconductivity, (g) Mixed precision operations on GPUs. for example, DMFT has been extended by quantum cluster ap- proaches such as the dynamical cluster approximation (DCA) II. BACKGROUND [2]–[4]. The study and understanding of strongly correlated elec- DCA is a numerical simulation tool to predict physical tronic systems is one of the greatest challenges in condensed behaviors (such as superconductivity, magnetism, etc.) of matter physics today. Strongly correlated electron systems are correlated quantum materials [9]. The DCA++ code3 computes characterized by strong electron-electron interactions, which the many-body (single-particle and two-particle) Green’s func- give rise to exotic states and fascinating properties such as tions for a Hubbard-type material of interest. Properties of the multiferroicity2 and high-temperature superconductivity. These materials, such as the superconducting transition temperature, properties open the door for technological advances in appli- can be calculated from these many-body Green’s functions. cations such as data storage and MRI machines. However, DCA++ has an iterative, self-consistent algorithm with two current methodologies based on density