Density Matrix Quantum Circuit Simulation Via the BSP Machine on Modern GPU Clusters

Industrial and Systems Engineering Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters ANG LI1,OMER SUBASI1,XIU YANG2, AND SRIRAM KRISHNAMOORTHY3 1Pacific Northwest National Laboratory (PNNL), Richland, WA, 99354, USA 2Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA, USA 3Washington State University, Pullman, WA, 99164, USA ISE Technical Report 20T-029 Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters † ‡ Ang Li∗, Omer Subasi∗, Xiu Yang∗ and Sriram Krishnamoorthy∗ ∗Pacific Northwest National Laboratory (PNNL), Richland, WA, 99354, USA †Lehigh University, Bethlehem, PA, 18015, USA ‡Washington State University, Pullman, WA, 99164, USA Email: [email protected], [email protected], [email protected], [email protected] ________ L Abstract—As quantum computers evolve, simulations of quan- j0 = 0 H H tum programs on classical computers will be essential in vali- | i | i • • • _____ ___ ______L __ dating quantum algorithms, understanding the effect of system j1 = 0 H S H noise, and designing applications for future quantum computers. | i | i • • _ ____ ___ ______L __ In this paper, we first propose a new multi-GPU programming j2 = 0 H T S H methodology called MG-BSP which constructs a virtual BSP | i | i • _ ____ ___ machine on top of modern multi-GPU platforms, and apply s this methodology to build a multi-GPU density matrix quantum 0 | i U 4 U 2 U simulator called DM-Sim. We propose a new formulation that s1 can significantly reduce communication overhead, and show | i that this formula transformation can conserve the semantics Fig. 1: Quantum circuit. The horizontal lines represent qubits. The despite noise being introduced. We build the tool-chain for the blocks along the lines represent gates. Execution is from left to right. simulator to run open standard quantum assembly code, execute synthesized quantum circuits, and perform ultra-deep and large- for quantum-inspired algorithms where an algorithm derived scale simulations. We evaluated DM-Sim on several state-of-the- for QC is reversely deployed on a classical computer [59]. art multi-GPU platforms including NVIDIA’s Pascal/Volta DGX- 1, DGX-2, and ORNL’s Summit supercomputer. In particular, we A variety of quantum circuit simulators on classical com- have demonstrated the simulation of one million general gates puters have been proposed [1]. However, most of them target in 94 minutes on DGX-2, far deeper circuits than has been logical qubits in an ideal isolated environment where full demonstrated in prior works. Our simulator is more than 10x gate fidelity can be asserted, rather than physical qubits in faster with respect to the corresponding state-vector quantum simulators on GPUs and other platforms. The DM-Sim simulator a practical open environment with unavoidable noise. Since is released at: http://github.com/pnnl/DM-Sim. QC simulation demands huge amount of memory [11], these 2n 1 simulators often rely on state vector ψ = ∑ − αi i to | i i=0 | i I. INTRODUCTION conserve the pure quantum states. The transformation by Despite holding substantial promise, quantum computing applying a quantum gate described as an unitary operator U thus is ψ U ψ . However, when noise is introduced, (QC) based on today’s noisy-intermediate-scale-quantum de- | i → | i vices (NISQ) [72] is still distant from beating classical super- we have to deal with a mixed state comprising a statistical computers. One major limitation is the error, particularly the ensemble of multiple distinct quantum systems in a density decoherence of qubits which introduces significant uncertainty matrix. In an ensemble, if the M different quantum systems are in states ψ j with probability Pj (1 j M), the density to the quantum states and corrupts the functionality of the | i M ≤ ≤ matrix can be expressed as ρ = ∑ Pj ψ j ψ j . However, Pj quantum circuit. The stable coherence duration varies across j=1 | ih | different QC technologies, from microseconds to seconds with in a real quantum physical system is typically statistical rather different error rate [2] and readout fidelity [3]. Consequently, than determinable. Sometimes it is unknown on purpose [9]. identifying how the introduced error propagates among qubits There are two fundamental types of quantum gate errors: (i) and along the circuit becomes a critical issue for QC research. Coherent errors conserve the purity of the input state. Instead Directly inspecting the intermediate states of a physical of executing U, another unitary operation U is essentially quantum computer is, however, infeasible. Due to fundamen- applied. (ii) Incoherent errors do not conserve the purity of tal quantum rules, whenever a measurement is applied to the input state. As a result, the state transitione can only be certain qubits, it destroys the superposition state and alters described through a density matrix [65], which contains all the computation logic. Consequently, simulating the quantum information necessary to decide the probability of any out- circuit (see Figure 1) through classical computers becomes a come of the circuit in any future measurement. Therefore, to necessary approach to unfold the black-box, investigate the simulate quantum circuits for NISQ devices lacking quantum error, and validate the quantum algorithm and hardware in a error correction (QEC) support, being able to simulate using more tractable approach. This is particularly the case when density matrix can be crucial, sometimes inevitable [9], [35]. theoretical bounds are inherently imprecise (e.g., Trotter error The major difference between density matrix and state bounds for time evolution of a Hamiltonian [8]). Furthermore, vector simulation is that: to simulate the same amount of efficient classical simulations can also form the starting point qubits n, the memory access and occupation, the computation, SC20, November 9-19, 2020, Is Everywhere We Are U.S. Government work not protected by U.S. copyright the network data exchange all scale in O(4n) rather than while the processing of a single gate is in a streaming manner O(2n), leading to great pressure on the hardware resources with rare data reuse. Traditionally, GPUs lack whole-device consumption and dramatically expanded simulation time. In inter-thread-block synchronization mechanism and GPU-side- fact, existing work simulates n-qubits density-matrix through initiated communication mechanism. As a result, existing 2n-qubits state vector [33]. Due to the different simulation pur- designs tend to simulate one gate per GPU kernel followed by poses, in this work we do not seek to simulate more qubits than CPU-side communication or synchronization, introducing con- state-of-the-art state-vector simulators [13], [68], [84] through siderable overhead from repeated kernel invocation & release techniques such as using secondary storage [68], presuming [22] (GPU kernel start latency can be as long as 20µs [36]), particular gates [13], and lossy compression [84]. Rather, this and frequent CPU-GPU execution transition. This overhead work attempts to tackle three major challenges: (i) Circuit is further amplified with multi-GPUs and multi-nodes; (III) depth: As a well-known approach, quantum simulators tend Interconnect. Despite this being the first simulator using multi- to decompose each multi-qubits gate into a sequence of 1- GPUs for density matrix simulation, an existing work has qubit or 2-qubits gates [4], [6], [7], [15], [22]–[24], [26], already applied multi-GPUs for state vector simulation [86]. [27], [34], the actual circuit for useful quantum algorithms However, it did not leverage the recent advancement in GPU can be extremely deep, such as quantum chemistry simulation interconnect, partially due to the CPU-centric programming (e.g., 1014 gates [73], 1018 gates [65]), quantum approximate model where all GPUs are managed by a master CPU and data optimization algorithm (QAOA) [17], and quantum neural exchange is only between CPU and GPUs; (IV) Optimization. networks [18]. This is exacerbated when hardware constraints Traditional simulators tend to precisely reflect particular QC are taken into consideration [51]. However, existing simulators technology parameters [66], rather than formulating the pro- often simulate from a single gate to a few hundreds of cess from an efficient simulation angle, leading to redundant gates [15], [22], [24], [26], [34], [68]. (ii) Performance: computation and communication. Due to the large problem size, density matrix simulation In this paper, we propose a new programming methodology can be remarkably slow. This condition can be even worse called MG-BSP that constructs a virtual Bulk-Synchronous- with very deep circuits and the fact that simulations are Parallel machine on top of modern multi-GPU platforms. We often repeated many times to, for example, obtain converged demonstrate its programmability and efficiency by applying distribution sampling, study the influence of hyper-parameters it to the density matrix quantum circuit simulation problem, and noise, and train a quantum machine learning model, etc. and evaluate the performance on several multi-GPU platforms, (iii) Programming flexibility: Although the major target is to including NVIDIA DGX-1 [61], DGX-2 [63] and ORNL support arbitrary density matrices for error study, defining new Summit HPC. The results demonstrate the effectiveness, flex- gates with advanced optimization should be straightforward. In ibility and scalability of MG-BSP and the DM-Sim simulator. other words, the simulator should ensure the programmability Particularly, we show that a 15-qubit density matrix simulation of the quantum gates while conserving execution efficiency. (i.e., the same scale as a 30-qubit state vector simulation) with Due to massive fine-grained parallelism, intensive double- 1 million arbitrary gates can be accomplished in 94 minutes precision compute, scarce memory bandwidth demand and ex- ( 5.6 ms/gate) on the NVIDIA DGX-2 system with 16 Volta haustive inter-processor communication, state-of-the-art multi- V100≈ GPUs, far deeper and quicker than has been demon- GPU HPC systems become the ideal platforms for density ma- strated before.

Load more