Stim: a Fast Stabilizer Circuit Simulator Craig Gidney
Total Page:16
File Type:pdf, Size:1020Kb
Stim: a fast stabilizer circuit simulator Craig Gidney Google Inc., Santa Barbara, California 93117, USA June 21, 2021 This paper presents “Stim", a fast simulator for quantum stabilizer circuits. The paper explains how Stim works and compares it to existing tools. With no foreknowledge, Stim can analyze a distance 100 surface code circuit (20 thousand qubits, 8 million gates, 1 million measurements) in 15 seconds and then begin sampling full circuit shots at a rate of 1 kHz. Stim uses a stabilizer tableau representation, similar to Aaronson and Gottesman’s CHP simulator, but with three main improvements. First, Stim improves the asymptotic complexity of deterministic measurement from quadratic to linear by tracking the inverse of the circuit’s stabilizer tableau. Second, Stim improves the constant factors of the algorithm by using a cache-friendly data layout and 256 bit wide SIMD instructions. Third, Stim only uses expensive stabilizer tableau simulation to create an initial reference sample. Further samples are collected in bulk by using that sample as a reference for batches of Pauli frames propagating through the circuit. Readers who want to try Stim can find its source code in this paper’s ancillary files or on github at https: // github. com/ quantumlib/ stim . Stim is also available as a Python 3 pypi package installed via “pip install stim". There is also a pypi package “stimcirq" that exposes stim as a cirq sampler. 1 Introduction In the field of human computer interaction, an important metric is the delay between human action and computer reaction. Quoting Jakob Nielsen [25, 16]: • 0.1 seconds is about the limit for having the user feel that the system is reacting instantaneously [...] • 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted [...] • 10 seconds is about the limit for keeping the user’s attention [...] arXiv:2103.02202v3 [quant-ph] 18 Jun 2021 I bring this up because the underlying motivation driving this paper is that I want to tinker with huge stabilizer circuits, and I want to stand in the 0.1s bucket while I do it. For example, consider the single level 15-to-1 surface code T state factory shown in figure 4 of [11]. It corresponds to a circuit with millions of gates spanning 15 thousand qubits. What would it take to simulate that in 0.1 seconds, with no built-in preconceptions about the surface code, while a user makes iterative changes that break and restore the functionality of the circuit? There are several obstacles that make this difficult. First, if a circuit has a million gates, and your computer’s clock speed is 2 gigahertz, then the simulator has a budget of 200 clock cycles per gate. That’s a tight budget. 1 2 Second, describing an arbitrary n-qubit stabilizer state takes at least 2 n bits [14, 22, 18]. This means that, in the worst case, a general stabilizer simulator that works in an online fashion will Craig Gidney: [email protected] Accepted in Quantum 2021-06-16, click title to verify 1 Figure 1: Delay until one thousandth sample on d × d × d surface code memory circuits [15]. The intention of this benchmark is to highlight simulators that can perform some sort of compilation or analysis that lowers the cost of taking multiple samples. Stim does particularly well on this benchmark, with timings almost identical to its time-to-first-sample. This is because it uses its first sample as a reference for batches of hundreds or thousands of Pauli frame simulations that are updated in parallel using SIMD instructions. The per-sample performance of Qiskit’s simulator also improves when collecting more than one sample. Accepted in Quantum 2021-06-16, click title to verify 2 need megabytes of information to track the state of a 15000 qubit system. Storing this amount of information is not a problem. The problem is repeatedly touching that information without blowing the budget on cache misses. Third, typically, simulating a Clifford gate has complexity Ω(n) where n is the number of qubits in the circuit. Given the aforementioned cycle budget, the constant factor hiding behind that Ω has to be roughly 0.01 cycles. Fourth, the surface code is chock full of measurements and, basically, historically, the cost of measurements in stabilizer simulators started at Θ(n3) [13], improved to Θ(n2) [1], and then stayed there. Aaronson and Gottesman’s [1] CHP simulator? Measurement takes Θ(n2) time in the worst case. Anders and Briegel’s [3] graph state simulator? Measurement takes Θ(n2) time in the worst case. Bravyi et al’s [6] CH form simulator? Measurement takes Θ(n2) time in the worst case. We can’t afford Θ(n2) time. Fifth, the circuit I’ve been talking about isn’t even technically a stabilizer circuit. It contains magic state injections. This obstacle has actually been the subject of a lot of recent work [6,7, 20, 19]. I’ll be completely ignoring it in this paper. Sixth, when simulating an error correcting circuit, it’s common to sample the circuit millions or even trillions of times. For example, going to large code distances can allow a lone half distance error mechanism to shine through the background haze of full distance errors. But logical errors become rare at large code distances, and so many samples have to be taken to see them. Based on discussions I’ve had with co-workers who have done these sorts of large scale simulations [12], the main bottleneck is not decoding the errors but rather (surprisingly) generating the samples. So it’s not just the latency until the first sample that matters; the sustained bulk sampling rate is also important. With all those obstacles listed I should note that, of course, by using hardcoded knowledge about magic state distillation and the surface code, it’s possible to make a specialized simulator that could solve my example case much faster than a general stabilizer circuit simulator. But my underlying motivation is one of exploration, and I would strongly prefer to create a tool whose performance and correctness doesn’t require understanding the behavior of places I haven’t been yet. So, in this paper, I present my failed attempt to reach my ridiculous goal. The result is a software tool I’ve named "Stim". Although Stim can’t simulate a 15 thousand qubit surface code circuit with millions of gates in under a tenth of a second, it can at least get that job done in roughly ten seconds. Then, with that initial sample acquired, it can begin producing batches of hundreds of samples in a tenth of the time. Stim makes three key improvements over previous stabilizer simulators. These improvements ad- dress obstacle 4 (the measurement obstacle), obstacle 3 (the constant factor obstacle), and obstacle 6 (the mega sampling obstacle). The measurement obstacle is addressed by making an algorithmic improvement that decreases the time complexity of deterministic measurements from quadratic to linear. This is beneficial in contexts, such as the surface code, where a circuit has many measurements but those measurements are almost entirely deterministic. The constant factor obstacle is addressed by using SIMD (Same Instruction Multiple Data) in- structions in tight loops iterating over contiguous memory. Using SIMD is not a novel concept (e.g. Aaronson et al’s CHP simulator [1] packs the bits from Pauli terms into bytes, which allows 8 to be operated on by each instruction) but it’s technically challenging to implement well and Stim does it at larger register sizes than previous work. Specifically, Stim uses 256 bit wide Advanced Vector Exten- sions (AVX) [30] and also I spent a lot of effort minimizing the number of operations in key loops. For example, a core operation for most gates is multiplying together strings of Pauli terms, which requires accumulating scalar factors from individual Pauli multiplications. I found a way to do this using only 11 bitwise operations per group of Paulis (see Figure 2). Modern Intel CPUs can apply 3 bitwise AVX operations per clock cycle [21], and 256 is much larger than 11, suggesting that a budget of 0.01 cycles per qubit during a Clifford gate isn’t quite as ridiculous as it seems. The mega sampling obstacle is addressed by including a Pauli frame simulator that uses SIMD instructions to operate on the state of hundreds of simulations in parallel. This amortizes the cost of cache misses and branch mispredictions and other performance killers. This is a simple enough idea, Accepted in Quantum 2021-06-16, click title to verify 3 but it gives Stim an enormous advantage over previous work when bulk sampling (see Figure 1). This paper is divided into sections. In Section 2, I explain key background concepts that are needed in order to understand the simulation strategies being described. Then in Section 3 and Section 4I describe the Pauli frame simulator and stabilizer tableau simulators implemented by Stim. Section 5 describes several of the software engineering details that went into Stim, such as why I decided to use a dense representation instead of a sparse representation and how I went about testing that operations were implemented correctly. Stim’s performance is compared to other stabilizer simulators in Section 6. Finally, Section 7 wraps everything up. There are also a few short usage examples in Appendix A. 2 Key Concepts 2.1 Pauli Products 1 0 0 1 I = X = Z = There are four Pauli gates: the identity 0 1 , the bit flip 1 0 , the phase flip 1 0 0 −i Y = 0 −1 , and the other one i 0 .