Cache Blocking Technique to Large Scale Quantum Computing

Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers 1st Jun Doi 2nd Hiroshi Horii IBM Quantum, IBM Research - Tokyo IBM Quantum, IBM Research Tokyo Tokyo, Japan Tokyo Japan [email protected] [email protected] Abstract—Classical computers require large memory resources (GPUs) are particularly useful in this aspect: their thousands of and computational power to simulate quantum circuits with a hardware threads update probability amplitudes in parallel and large number of qubits. Even supercomputers that can store their high bandwidth memory (HBM) reduces bottlenecks in huge amounts of data face a scalability issue in regard to parallel quantum computing simulations because of the latency loading and storing the amplitudes. Some quantum computing of data movements between distributed memory spaces. Here, simulators [11]–[14] support the use of GPUs to accelerate we apply a cache blocking technique by inserting swap gates in simulations. quantum circuits to decrease data movements. We implemented We focus on optimizing a state vector simulator, which is this technique in the open source simulation framework Qiskit simple and stable to simulate noisy and noiseless quantum Aer. We evaluated our simulator on GPU clusters and observed good scalability. computing simulations on classical parallel computers. There Index Terms—Quantum Computing Simulation, Cache block- are some approaches to decrease the memory usage of a ing, parallel computing, GPGPU, GPU, Thrust, MPI quantum computing simulation on classical computers. For example, data compression specialized for quantum com- I. INTRODUCTION puting simulation were introduced in [15], [16] to simu- Hundreds of qubits of quantum computers are becoming late noisy quantum computers with more than 50 qubits on realistic in the near future and existing quantum computers supercomputers. However, their lossy data compression and are available for development and evaluation of quantum lacks of fidelity limit quantum algorithms to be simulated. applications [1]–[3]. Some of these quantum computers are Use of tensor network [17] is another approach to simulate available as cloud services, but their computer resources are noiseless quantum computers while storing quantum state as still not enough or their accesses are limited. As well, numbers tensor matrices. Tensor-network-based simulators can reduce of qubits will still be limited and noise of quantum gate memory for specific applications where qubits are weekly operations will exist in this few decades. Therefore, quan- interacted but not for general applications where qubits are tum computing simulations running on classical computers strongly interacted. We believe that a state vector simulator are necessary for developing and debugging new quantum will be the most popular in developing and debugging quantum applications as well as actual quantum hardware is. However, algorithms even though it requires huge memory to directly quantum computing simulators on classical computers require store 2n probability amplitudes. Users can simply increase huge amounts of memory and computational resources. An n- more available qubits with additional computing nodes and qubit state vector requires an array of length 2n to be stored on data storage. arXiv:2102.02957v1 [quant-ph] 5 Feb 2021 a classical computer; for example, a 50-qubit simulation needs Though system can provide enough memory, inefficient 16 petabytes worth of storage for double precision floating memory usage in state vector simulators is a critical problem point numbers. This means that simulating large numbers of to simulate many qubits; the communication overheads over qubits on a classical computer is a form of supercomputing. hybrid parallel computers become inhibitors to scale simula- In this study, we focus on accelerating quantum computing tion performance. To parallelize a state vector simulator on a simulations by using GPUs on hybrid distributed parallel distributed memory space, probability amplitudes of quantum computers. Universal quantum computing simulations [4]–[6] state must be exchanged across different memory spaces. In are now available for developing quantum applications with general, network bandwidth between different memory spaces smaller numbers of qubits (around 20 qubits) on classical com- is much lower than memory bandwidth of CPU and GPU. puters, even on desktop or laptop personal computers. To sim- Minimizing data exchanges across memory spaces is a key to ulate rather more qubits (around 50-qubits), parallel simulators scale simulation performance. [7]–[11] must store the quantum state in the huge distributed Here, we propose a new technique to decrease the number memory in parallel-processing computers. Parallel computing of data exchanges across distributed memory spaces by com- has the advantage of accelerating simulations that require a bining quantum circuit optimization and parallel optimization. huge amount of computational power. Graphic processor units Qubit reordering or remapping are circuit optimizations used to map circuits onto the topology of quantum devices. Qubit reordering can be used to decrease data exchanges between iλ large numbers of qubit gates. The technique we developed cos(θ/2) −e sin(θ/2) u3(θ,ψ,λ)= iψ i(ψ+λ) . (1) decreases data exchanges by moving all the gates associated e sin(θ/2) e cos(θ/2) with a smaller number of qubits by inserting noiseless swap gates. The operations of the optimized circuits resembles cache When it is applied to the k-th qubit of {qn ··· q1}, a u3 gate blocking on a classical computer, and the concept of the transforms the probability amplitudes of the basis states optimization is similar to one on a classical computer because |xn ··· xk+1 0 xk−1 ··· x1i and |xn ··· xk+1 1 xk−1 ··· x1i we block the gates on the qubits that can be accessed faster. for each xi in {0, 1} by linear transformation of their previous Section II overviews quantum computing and state vector probability amplitudes. This implies that applying a single- n simulators. In Section II, we describe how to parallelize state qubit u3 gate can change all 2 complex numbers stored for vector simulators on distributed computers, and in Section tracking the evolution of the quantum state. However, the IV, we describe the circuit optimization and cache blocking changes can be computed in parallel, as each of the new implementation of a parallel state vector simulator. In Section probability amplitudes depends on its own value and on the V, we discuss performance evaluations on a hybrid parallel value of a probability amplitude whose basis state is different computer accelerated by GPUs. We conclude the paper in at the k-th location. Section VI with mention of future work. The CNOT gates are applied to two qubits: a control and a target qubit. If the control qubit is in state |1i, the CNOT II. PARALLEL STATE VECTOR SIMULATION gate flips the target qubit. If the control qubit is in state |0i, A. State Vector Simulation Overview the CNOT gate does not change the target qubit. If we have We focus on simulating universal quantum computers based two qubits and take the higher bit as the control qubit and on quantum circuits that consist of one-qubit rotation gates and the other as the target qubit, the CNOT gate is mathematically two-qubit CNOT gates. These gates are known to be universal; defined as i.e., any quantum circuit (for realizing some quantum algo- 1 0 0 0 0 1 0 0 rithm) can be constructed from CNOT and one-qubit rotation CNOT = . (2) 0 0 0 1 gates. 0 0 1 0 Quantum circuits run quantum computing programs by applying quantum gates to qubits. A qubit has two basis states, 1 0 When applied to the c-th and t-th qubits (control and target, |0i = and |1i = . While a classical bit stores 0 or 1 0 1 respectively), the CNOT gate swaps the probability ampli- exclusively, a qubit allows superposition of the two basis states tude of |xn ··· (xc = 1) ··· (xt = 0) ··· x1i with that of the as |ψi = a0 |0i + a1 |1i, where the probability amplitudes a0 |xn ··· (xc = 1) ··· (xt = 1) ··· x1i. The swaps, which affect 2 2 and a1 are complex numbers satisfying |a0| + |a1| = 1. half of the probability amplitudes of the quantum state, can When a qubit is measured, an outcome of 0 or 1 is obtained also be performed in parallel. 2 2 with probability |a0| or |a1| , respectively. This means two On a classical computer, the state vector is stored as an complex numbers for tracking the quantum state of a one-qubit array of probability amplitudes which are expressed as floating register must be stored in the simulation. point complex numbers. Fig. 1 shows an example of a state The quantum state of an n-qubit register is a linear super- vector array of a 4-qubit quantum circuit stored in the memory position of 2n basis states, each of the form of the tensor of a classical computer. If we store a complex number as a product of n qubits. For example, the basis state of digit ‘2’ double precision floating point number, an n-qubit state vector (or 10 in binary) can be represented as |2i = |1i ⊗ |0i = requires 16 × 2n bytes of memory on a classical computer. |10i = (0, 0, 1, 0)T , where T denotes the transpose of a Quantum gate operations can be simulated on a classical vector. Thus, the quantumn state of the n-qubit register can be computer by multiplying a matrix with pairs of probability 2 −1 written as |ψi = i=0 ai |ii. Note that each state |ii has its amplitudes stored in the state vector array. Fig. 2 shows an own probability amplitudeP ai in a complex number. Overall, example of a u3 gate, which rotates the 2-nd qubit (zero- simulating quantum circuits requires that these 2n complex based index) on a 4-qubit state vector stored in the memory numbers be stored to enable tracking of the evolution of the of a classical computer.

Load more