Qdiff: Differential Testing of Quantum Software Stacks

QDiff: Differential Testing of Quantum Software Stacks Jiyuan Wang, Qian Zhang, Guoqing Harry Xu, Miryung Kim University of California, Los Angeles CA, United States fwangjiyuan, zhangqian, harryxu, [email protected] Abstract—Over the past few years, several quantum software A quantum software track (QSS) includes (1) APIs and stacks (QSS) have been developed in response to rapid hardware language constructs to express quantum algorithms, (2) a advances in quantum computing. A QSS includes a quantum compiler that transforms and optimizes a given input quantum programming language, an optimizing compiler that translates a quantum algorithm written in a high-level language into quantum algorithm at the circuit level, and (3) a backend executor that gate instructions, a quantum simulator that emulates these either simulates the resulting gates on classical devices or instructions on a classical device, and a software controller that executes directly on quantum hardware. Consider Qiskit [4], sends analog signals to a very expensive quantum hardware based which consists of four components: QiskitTerra that com- on quantum circuits. In comparison to traditional compilers piles and optimizes programs, QiskitAer that supports high- and architecture simulators, QSSes are difficult to tests due to QiskitIgnis the probabilistic nature of results, the lack of clear hardware performance noisy simulations, for error cor- specifications, and quantum programming complexity. rection, noise characterization, and hardware verification, and This work devises a novel differential testing approach for QiskitAqua that helps a developer to express a quantum QSSes, named QDIFF with three major innovations: (1) We algorithm or an application. Other QSSes [2] such as Pyquil generate input programs to be tested via semantics-preserving, have a similar set of components—e.g., Pyquil uses quilc as source to source transformation to explore program variants. qvm (2) We speed up differential testing by filtering out quantum a compiler and as a quantum simulator. circuits that are not worthwhile to execute on quantum hardware As with any compiler framework, a QSS could be error- by analyzing static characteristics such as a circuit depth, 2- prone. Developers and users often report bugs on popular gate operations, gate error rates, and T1 relaxation time. (3) QSSes [3], [5], [6], and a simple search on StackOverflow We design an extensible equivalence checking mechanism via distribution comparison functions such as Kolmogorov–Smirnov with the keyword “quantum error” would bring up over 500 test and cross entropy. posts on various QSS components, ranging from compiler We evaluate QDIFF with three widely-used open source QSSes: settings, simulation, and the actual hardware [7]. These posts Qiskit from IBM, Cirq from Google, and Pyquil from Rigetti. By often reveal deeper confusion that developers face due to running QDIFF on both real hardware and quantum simulators, inherent probabilistic nature of quantum measurements—if a we found several critical bugs revealing potential instabilities program produces a result that looks different from what is in these platforms. QDIFF’s source transformation is effective in producing semantically equivalent yet not-identical circuits expected, is it due to a bug or the non-determinism inherent (i.e., 34% of trials), and its filtering mechanism can speed up in quantum programs? Is there divergence beyond expected differential testing by 66%. noise coming from an input program, a compiler, a simulator, and/or hardware? I. INTRODUCTION Quantum computing is an emerging computing paradigm Challenges. There are three technical challenges that make it that promises unique advantages over classical computing. difficult to test QSSes. However, quantum programming is challenging. Quantum The first challenge is how to generate semantically equiva- computation logic is expressed in qubits and quantum gates, lent programs for testing quantum compilers [8], [9]. Though and the states of quantum registers are measured probabilis- existing compiler testing techniques generate equivalent pro- tically. Due to the physical properties of quantum mechanics, grams by manipulating code in dead regions or unexecuted qubits and quantum gates are fundamentally different from branches [10], a quantum program usually does not have many classical bits and gate logic. For example, quantum indeter- unexecuted branches. Therefore, we must design a technique minacy dictates that the same quantum program can produce that can produce a large number of semantically-equivalent different results in different executions. variant circuits that may induce divergence on hardware. To this end, many quantum software stacks such as Google’s The second challenge is that compilers are not the only Cirq [1], Rigetti’s Pyquil [2], and IBM’s Qiskit [3] have source of bugs in a QSS. For example, in a line of recent been developed to provide user-friendly high-level languages work [11], [12] on verified quantum compilers, compilers are for quantum programming, abstracting away the underlying guaranteed to be correct with respect to their optimization physical and mathematical intricacies. and transformation rules. However, the overall correctness of QSSes goes beyond compiler optimization rules. We must of logically equivalent circuits such as the circuit depth (i.e., reason about subsequent execution on quantum simulators and moments), the number of 2-qubit gates, the known error hardware, which are often the major sources of instabilities. rates of each gate kind, and T1 relaxation time for quantum In fact, Rigetti or Cirq’s Github histories indicate that bugs hardware. QDIFF obviates the need of executing certain circuits may come from its various backends due to improper initial- on hardware or noisy simulators, if the logically equivalent ization [13], problematic APIs [14], or inappropriate synthesis circuits are incapable of revealing meaningful instabilities in options [15]. QSS. The third challenge is how to interpret measurements from Third, in order to compare executions of logically equivalent quantum simulators or hardware, since quantum measurement circuits, QDIFF first identifies how many measurements are operations by definition are probabilistic in nature due to needed for reliable comparison. Based on the literature of quantum indeterminacy [11]. Therefore, when we execute distribution sampling, QDIFF adapts closeness testing in L1 logically equivalent program variants, comparing their mea- norm [17] to estimate the required number of measurements surement results is not straightforward, as we must take into given confidence level p = 2=3. Using either a noisy simulator account inherent noise (e.g., hardware gate errors, readout or directly on hardware, it then performs the required number errors, and decoherence errors) and must determine whether of measurements. To compare two sets of measurements, the observed divergence is significant enough to be considered QDIFF uses distribution comparison methods. Two methods as a meaningful instability. based on Kolmogorov-Smirnov (K-S) test [18], [19] and Cross Entropy [20], [21] are currently supported. This com- This Work. QDIFF is a novel differential testing technique for parison method is easily extensible in QDIFF, as a user can QSSes. QDIFF takes as input a seed quantum program and define a new statistical comparison method [17] in QDIFF. reports potential bugs with a pair of witness programs (i.e., logically equivalent programs that are supposed to produce Results. We evaluated QDIFF with the latest versions of three identical behavior) but triggers divergent behavior on the target widely-used QSSes: Qiskit, Cirq, and Pyquil. With six seed QSS (i.e., meaningful instabilities beyond expected noise). quantum algorithms, QDIFF generates 730 variant algorithms QDIFF overcomes the aforementioned challenges with three through semantics-modifying mutations. Starting from the technical innovations below. generated algorithms, it generates a total of 14799 program variants using semantics-preserving source transformations. First, QDIFF generates logically equivalent quantum pro- This generation process took QDIFF 14 hours. With the filtering grams using a set of equivalent gate transformation rules. mechanism, QDIFF reduces its testing time by 66%. Using This is based on the insight that each quantum gate can be QDIFF, we determined total 6 sources of instabilities. These represented mathematically as a unitary matrix; a sequence include 4 software crash bugs in Pyquil and Cirq simulation, of gates essentially corresponds to the multiplication of their and 2 potential root causes that may explain 25 out of 29 cases unitary matrices, which also yields a unitary matrix. As such, of divergence beyond expected noise on IBM hardware. one sequence of gates is semantically equivalent to another if the two sequences yield the same unitary matrix. We The main contributions of this work include: leverage seven gate transformation rules to generate different • We present the first differential testing framework for gate sequences that are guaranteed to yield the same unitary QSSes. This framework is integrated with three widely matrix. These sequences essentially define logically equivalent used QSSes from Google, IBM, and Rigetti. program variants that are supposed to yield identical behavior. • We embody three technical innovations in QDIFF to make Such variants are then executed on a simulator or hardware differential testing

Load more