<<

Advantages of a modular high-level quantum programming framework

Damian S. Steiger,1, ∗ Thomas H¨aner,1, † and Matthias Troyer1 1Theoretische Physik, ETH Zurich, 8093 Zurich, Switzerland We review some of the features of the ProjectQ software framework and quantify their impact on the resulting circuits. The concise high-level language facilitates implementing even complex algorithms in a very time-efficient manner while, at the same time, providing the compiler with additional information for optimization through code annotation – so-called meta-instructions. We investigate the impact of these annotations for the example of Shor’s algorithm in terms of logical gate counts. Furthermore, we analyze the effect of different intermediate gate sets for optimization and how the dimensions of the resulting circuit depend on a smart choice thereof. Finally, we demonstrate the benefits of a modular compilation framework by implementing mapping procedures for one- and two-dimensional nearest neighbor architectures which we then compare in terms of overhead for different problem sizes. Keywords: , Compilers, Quantum Programming Languages

I. INTRODUCTION

Quantum computers will be able to solve certain prob- lems faster than any classical supercomputers and thus enable finding solutions to problems that are intractable classical on any future classical computer. Quantum computers are not intended to replace classical technology. Rather, quantum they should be viewed as special-purpose accelerators, Run time similar to today’s GPUs or FPGAs which are running in compute centers to speed up specific applications or subprocesses thereof. Problem size There are many reasons to believe that quantum com- puters will not replace classical computers. One is that most of the currently pursued technologies to build quan- Figure 1. Illustration of what is typically encountered when comparing a quantum algorithm which exhibits a quantum tum bits require a vacuum chamber or temperatures on speed-up to the best classical algorithm in terms of run time. the order of milliKelvin, which makes them bulky and un- The crossover point, i.e., the problem size after which the suitable for mobile technology. In addition, there are fun- quantum algorithm outperforms its classical counterpart, is damental constraints for the programs running on quan- shown as a red dashed line. tum hardware due to the laws of quantum mechanics. In particular, all operations must be made reversible which incurs a large polynomial overhead in both space and few examples of algorithms with quantum speedups are time when translating a classical computation consisting known and finding more is a very challenging task which of, e.g., NAND gates to reversible Toffoli gates. Further- is crucial to the development of the whole community. more, to successfully run a quantum program on a phys- In order to determine if quantum algorithms with a ical device, has to be employed scaling advantage can be useful for real applications, it in order to reduce the effects of noise on the computa- is important to investigate at which problem size the tion. This causes quantum computers to run at a much crossover point is reached after which the quantum al- arXiv:1806.01861v1 [quant-ph] 5 Jun 2018 lower clock speed than classical ones. gorithm has lower runtime, see Fig. 1 for an illustration. Hence, the focus of the quantum computing research If the crossover point is too far out, it might not be prac- community has been on finding applications for which tical to use a quantum computer, e.g., if observing any a quantum algorithm has a large scaling advantage in speed advantage requires a runtime of at least the age of time-to-solution, also known as a quantum speedup. A the universe [2]. Cost estimation of a quantum program handful of such algorithms has been discovered such as can be achieved time-efficiently using a full stack software the famous algorithm by Peter Shor for factoring inte- framework with a quantum programming language and gers [1]. This algorithm scales super-polynomially better sophisticated compilers. With these optimizing compil- than the best known classical algorithm and has applica- ers, such a framework also allows to lower the crossover tions in breaking certain encryption schemes. So far only point even in the absence of large-scale quantum com- puters. This is crucial in order to leverage the economic potential of small-scale quantum computers as soon as ∗ [email protected] possible. † [email protected] In this paper, we are concerned with the software stack 2 involved in running a quantum program. We provide a partial review of our software methodology [3] which was then implemented resulting in the ProjectQ software • Classical computation Host framework for quantum computing [4], but with a new • Static compilation of focus on some important aspects of the high-level pro- quantum program gramming language and new mappers. In particular, we show how different intermediate representations can de- crease the quantum resources for the example of Shor’s algorithm and quantify the resource improvements by us- • Runtime environment ing meta-instructions (code annotation) in the high-level Controller including low-level language. We then introduce a new feature of ProjectQ, compilation and QEC namely mapping to a linear chain of qubits with nearest neighbor gates or a two-dimensional square grid. Con- • Feedback with QPU sidering the overhead of mapping is important in deter- mining the crossover points of quantum algorithms run- ning on specific architectures. While our mappers scale • Quantum instructions optimally in terms of circuit depth, there is potential QPU to reduce the constant factors by finding better heuris- • Measurement tics. Providing these mappers and applications as open source software allows to incrementally improve their per- formance. Mappers are not just important in the long run, but crucial in the current Noisy Intermediate-Scale Figure 2. Different levels of logic in a quantum hardware Quantum (NISQ) technology era [5], where quantum re- stack. For some architectures, the number of intermediate sources are very limited. We conclude with an outline hardware levels may vary and lower parts of the stack may of future research with and development of the ProjectQ reside in a cryostat [20]. framework. Related work Besides ProjectQ, there have been nu- merous other contributions in this field of quantum pro- for a grid with n points, see [19]. gramming languages and compilers. A few of them are available as open source such as Quipper [6], a quan- tum program compiler implemented in Haskell, the Scaf- II. COMPILATION TO QUANTUM fCC compiler based on the LLVM framework [7], IBM’s HARDWARE QISKit [8], and Rigetti’s pyQuil [9]. Moreover, there are closed source quantum programming languages such as The goal of a quantum software stack is to compile Microsoft’s LIQUi |i [10] or Microsoft’s Q# [11], the lat- quantum programs to run them on actual quantum hard- ter of which currently allows executing the resulting cir- ware. We give a short overview of a quantum software cuits on a local simulator employing simulation kernels stack, outlining the challenges involved from a high-level from ETH Zurich [12]. perspective before going into the details in the next sec- The task of mapping a quantum circuit to a restricted tions. For a recent review on quantum programming, see interaction graph has been studied extensively. The aim Ref. [21] and more details on our methodology can be of this paper is to provide model implementations of map- found in [3]. pers in ProjectQ. Future work will be concerned with A high-level schematic of a large-scale quantum com- extending and improving their performance beyond the puter is shown in Fig. 2. A quantum computer func- current state of the art. Mapping to a linear nearest tions as an accelerator for a classical host computer to neighbor architecture is discussed in detail for example solve specific subproblems. The software stack running in [13–15]. Our implementation is similar to Hirata et on the host computer performs the static compilation of al. [14] for a linear nearest neighbor architecture. Our a quantum program. This process includes decomposing algorithm finds qubit placements using a greedy search operations into a low-level logical gate set such as, e.g., while theirs also employs more compute-intensive opti- the two-qubit CNOT gate and single-qubit rotations. Af- mizations in order to reduce the total number of swaps. ter decomposing the quantum program into a low-level We improve upon their method by using a standard odd- gate set, the compiler on the host computer has to map even transposition sort [16] instead of bubble sort for the all operations to a restricted connectivity graph where, routing which can reduce the circuit depth by a constant e.g., only nearest-neighbor qubits can perform a CNOT factor using the same number of swaps. Our implemen- gate and hence qubits may have to be routed by swap tation of a mapper for the two-dimensional square grid operations. follows the description in [17, 18]. It has been known Closer to hardware, we imagine a powerful classical since 1986 that there are sorting network√ for square grids controller which provides the runtime software environ- which have a worst-case overhead of 3 n in circuit depth ment. This includes error correction, rotation synthesis, 3

circuit-depth overhead solution using a brute-force ap- proach for near-term hardware. It is important, however, to keep in mind that we should only apply compilation ⇔ techniques which at least scale to quantum program sizes that we cannot classically simulate because only there, a quantum computer might show an advantage. All smaller programs are just proof of concept along the way toward larger quantum computers. Figure 3. If fast feedback is available, quantum controls be- fore measurement can be turned into classical controls after measurement. The opposite transformation can be applied if III. THE PROJECTQ FRAMEWORK feedback is slow. The same idea carries over to other control instructions such as loops. ProjectQ is a full stack, open source software frame- work which is implemented as an embedded domain- and magic state distillation [22]. Note that while accel- specific language (eDSL) in Python. For an introduc- erators such as GPUs are at least partially independent tion we refer the reader to our release paper [4] and of the host computer, i.e., they have their own operation the code examples and tutorials which are available on- fetch mechanisms, a quantum computer requires that a line [24]. ProjectQ defines a high-level language and com- classical chip dictates each operation to be executed. piles quantum programs to various backends, including The software stack of today’s quantum hardware is sig- quantum hardware such as the IBM Quantum Experi- nificantly simpler than the above because existing devices ence chips. feature only a few tens of qubits and there is not yet a To support research in quantum computing, we also distinction between physical and logical qubits. As a con- bundle various software backends and analysis tools into sequence, the current experimental setups do not yet re- our framework such as a resource counter, which provides quire a powerful classical controller and runtime software performance information such as the number of gates environment. However, it is possible to start exploring and circuit depth of the compiled programs. Moreover, optimization opportunities in some technologies where, for some of the proposed quantum algorithms, for ex- e.g., fast measurement feedback is possible. A simple ex- ample the variational quantum eigensolver [25], the suc- ample is shown in Fig. 3, where fast feedback can be used cess probability and/or the scaling with problem size are to measure a qubit earlier and hence reduce the effects of known asymptotically at best. Therefore, we also require decoherence if it is possible to apply quantum operations scalable quantum simulators in order to run small quan- depending on the measurement outcome. In this paper tum programs and extrapolate the performance in the we will not focus on the classical controller part and the absence of noise. Besides this, quantum simulators allow runtime software environment. However, when designing to find bugs in quantum code in a very pragmatic way. a software framework such as ProjectQ, it is important While one may want to aim at proving a program to to be aware of these upcoming changes in order to de- be correct, the complexity of quantum programs is even sign the framework accordingly. We discuss what types higher than of classical distributed programs for which of interfaces are required from the host computer to the we currently fail to verify even small subroutines such classical controller and how they can be added to Pro- as, e.g., certain locks. As a consequence, we do not ex- jectQ and its high-level language. pect to be able to theoretically prove the correctness of Compilation and resource estimation of large quantum every quantum program. Rather, we envision a combi- algorithms are limited due to performance bottlenecks in nation of theoretic validation and pragmatic testing to some compilers. This can become a problem already to- be the approach of choice. Detailed information about day, e.g., when trying to determine crossover points. On our quantum simulator can be found in Ref. [4] and we the other hand, compilation for current hardware is still discuss a highly scalable distributed quantum simulator sufficiently fast because noise limits quantum programs in Ref. [26]. Because quantum programs in ProjectQ are to circuit depths of below 100 gate operations. Further- written in a high-level language, we can furthermore use more, since current technologies still support arbitrary emulation techniques to significantly speed up the quan- single-qubit rotations due to the absence of a quantum tum simulator for some algorithms [27]. error correction protocol, these rotations do not need to be synthesized yet. As a consequence, the first quantum programs contain very few operations and the compiler A. High-level language running on the host computer will not exhibit any per- formance bottlenecks. As an example, while the deci- We first consider the levels of abstractions in a quan- sion problem whether we can map a quantum circuit to tum programming language. Currently there are two the connectivity graph of the underlying hardware in less main levels of abstractions used in the quantum comput- than a specific circuit depth increase is NP-complete [23], ing community. On the one hand, quantum algorithm re- for general graphs, we can still find a close to minimal searchers work at the highest level of abstraction, where 4 algorithms and subroutines are often specified in terms module 2 of their complexity in big O notation. This notation does not take into account constant factors which are impor- module 1 tant to determine crossover points with classical algo- † ≡ † rithms. On the other hand, researchers which are closer U V U U V U to experiments are implementing small quantum algo- compute action uncompute compute action uncompute rithms in the native gates of their hardware technologies. This is in line with recent open source programs introduc- Figure 4. Submodule which contains a Compute/Action/Un- ing so-called quantum assembly languages [28]. Writing compute pattern executed by a higher level module condi- quantum programs on such a low level has the benefit tional on a qubit being in state 1. In general, if a submodule of optimally using the available quantum hardware but is run controlled on a qubit, the compiler has to control each also comes with the downsides which are encountered in operation in module 1 on the control qubit of module 2 being classical computing: it makes writing useful programs in state 1. However, in this scenario the compiler only needs a very time-consuming task and the resulting programs to control the operation V as the other two operations U and are not portable. A high-level programming language, U † result in the identity if V is not applied. on the other hand, has the advantages of shorter develop- ment time, less burden for the programmer to understand all details, and code portability. In our software frame- Meta-instructions work ProjectQ, we provide both approaches, similar to what is done in classical high-performance computing to- As a high-level language, ProjectQ contains many day, where programs are written in high-level languages modules which are hierarchical combinations of lower- such as Python or C++ but certain performance bot- level subroutines. However, a simple concatenation of tlenecks are written as inline assembly code. Of course different circuit blocks would yield suboptimal constructs one then partially loses portability. However, we envi- with a tremendous overhead as we will see in this and sion the quantum software stack as both application- and the next subsection. Fortunately, it can be avoided using hardware-specific. Different hardware-specific functions code annotation in the high-level language together with together with a generic implementation, which works on an optimizing compiler as seen in the next subsection. all system, can be packaged into a library. For the pro- In combination, these two features allow to drastically grammer, a high-level language combined with low-level reduce the quantum resource requirements of a given im- instructions has the advantage that one does not need to plementation. learn a different language for writing application code or We now consider the combination of two very com- implementing a hand-optimized library function specific mon design patterns in quantum programs: controlled to a certain hardware technology. It is up to the good execution of a subroutine conditioned on the state of judgment of the programmer to choose the right level of a qubit being 1 and a pattern which we called com- abstraction for the task in question. pute/action/uncompute in [3]. The compute/action/un- compute pattern is just a sequence of three unitary op- As an example, considern the following code written erators U †VU, where the first unitary operator is the in the ProjectQ language: inverse of third operator. This is very common when im- plementing classical functions reversibly on a quantum computer (using Bennett’s trick [29], where the first and third stage correspond to U and U †, respectively) but § CNOT | (control_qubit, qubit) ¤ MultiplyByConstantModN(a, N) | quint also in quantum simulation. For examples see the Pro- jectQ quantum math library or the implementation of the TimeEvolution operator in ProjectQ. When combining ¦ ¥ these two patterns, there is a very simple optimization which can be done as shown in Fig. 4. For a compiler It is inspired by the bra-ket notation used in physics, it would be extremely difficult to find these patterns as i.e., U |ψi, where U is a unitary operation applied to a both U and V could feature several hundreds of opera- wavefunction ψ. We use the or operator (|) of Python tions. Therefore, we enable the programmer to annotate to achieve a similar syntax. It helps distinguishing our such design patterns for the compiler. In this particu- eDSL statements from normal Python code and sepa- lar case, we require that all gates are annotated with the rates the operation with classical parameters on the left information to which section of this design pattern (com- from the qubits on the right side of the or operator. In pute, action, or uncompute) they belong. In ProjectQ, our language, we call an operation applied to specific language constructs to annotate code with additional in- qubits a command. The first command is a low-level formation are called meta-instructions. controlled not operation acting on two qubits, while the second command is a high-level multiplication by a con- stant a modulo N applied to a register of quantum bits which are interpreted as a quantum integer. 5

Shor’s algorithm with C/UC 107 without C/UC Shor’s algorithm allows to find the prime factors of a n-bit number N. We use an implementation that is

106 based on Beauregard’s 2n+3 qubit circuit for Shor’s al- gorithm [30] but instead of implementing the entire cir- cuit following the paper, we build a math library which

105 we then use to implement the modular exponentiation Ratio routine which achieves #CNOT gates 45 40 |xi |0i⊗n 7→ |xi |ax mod Ni , 104 35 30 25 where a ∈ [2, ..., N − 1] is a randomly chosen integer, N 1521 51 77 143 391 713 is the number to factor, and |xi is the input register con- 3 10 sisting of 2n qubits in the uniform superposition state 15 21 51 77 143 391 713 |xi ∝ P |ii with n = dlog Ne. After executing this N i 2 modular exponentiation subroutine, an inverse quantum Fourier transform is applied, followed by a measurement Figure 5. Comparison of the number of CNOT gates which of all input qubits. The output of this measurement can result from compilation with and without special handling of then be used to determine the period of ax mod N and, compute (C) / uncompute (UC) sections that allows for bet- in turn, the factors of N [1]. The code can be found in ter optimization of subroutines which are executed controlled our example algorithms [24] or in the appendix of our on other qubits. The data without C/UC for N ≥ 77 was ex- ProjectQ paper [4]. trapolated using the CNOT gate counts of the first iteration There are several optimization opportunities and of Shor’s algorithm with a multiplier of 2dlog2 Ne, which is space/time tradeoffs which have been investigated in the total number of iterations in the iterative quantum phase several works [31–35]. In the following, we shortly out- estimation as implemented in [30]. Since each iteration fea- line the most crucial optimizations performed by Beau- tures a controlled modular multiplication, the difference in regard. First, after decomposing the modular exponen- gate counts between iterations is minor and results only from tiation of a into modular multiplications by constants i the fact that the multiplication is carried out by a different a2 , which can be implemented using modular additions, constant. one can use Draper’s addition in Fourier space [36] which requires no ancilla qubits to add a classical constant to a quantum register, thereby allowing to save O(n) qubits. More recent work also eliminates the substantial over- The syntax for this meta-instruction in ProjectQ is: head from the quantum Fourier transform required for this type of adder by constructing a purely Toffoli-based network of depth O(n) [32]. Second, using the circuit §with Compute(eng): ¤ identity in Fig. 3 from left to right, the final measure- U | qureg ment gates can be pulled through parts of the inverse V | qureg quantum Fourier transform, allowing to serialize the cir- Uncompute(eng) cuit for modular exponentiation such that only 1 of the 2n qubits of |xi need to be alive at the same time [37]. ¦ ¥ One more qubit can be removed by improving the con- struction of the modular addition circuit[31] and recent Our eDSL uses Python’s context handler (with ...) work [35] reduces the circuit width by another qubit to to allow the programmer to specify U as an indented a total of 2n + 1. block of instructions. Additionally, the context han- dler automatically creates the inverse U † which is ap- plied when calling the Uncompute function and hence this Box 6. Shor’s algorithm as used in this paper and discussion makes the code more compact and less error-prone. If of algorithmic improvements which have been published. this subroutine now gets executed conditional on some qubit being in state 1, the compiler can easily apply the optimization in Fig. 4. We demonstrate this using an implementation of Shor’s algorithm, see Box 6. In Classical instructions for quantum computing order to maximize the potential for reuse of our imple- mentation, we first implemented the required mathemat- We now consider how ProjectQ is currently dealing ical functions as gates in ProjectQ, thereby building a with classical functions and how it can be extended in small math library for quantum computing. We calcu- the future. A quantum programming language does not lated the resource overheads for small sizes of Shor’s al- just contain quantum gates but also classical operations. gorithm once with our compute/action/uncompute meta There are two different kinds of classical instructions. instruction enabled and once without. The results in First, there are classical functions which act on a super- Fig. 5 show a reduction of more than 40x in the number position of inputs and thus must be executed on quan- of CNOT gates. tum hardware, an example being modular multiplication 6 in Shor’s algorithm. Second, there are classical control instructions which need to be executed by the classical §with ClassicalControl(eng, function, qubits): ¤ # Execute this quantum code controller, see Fig. 2.

Many quantum algorithms require the execu- ¦where the classical function can either be specified in¥ tion of classical functions on a superposition of Python and then translated to, e.g., C or directly added inputs and therefore, these classical functions have as a string containing C code. Alternatively, we can add to be translated into reversible operations such a new syntax keyword for kernel functions and use a as the Toffoli gate. ProjectQ already contains custom pre-processor to extract these functions before a small quantum math library which specifies the Python code is interpreted. such classical functions in our eDSL. For example,

§ MultiplyByConstantModN(a, N) | quint ¤ B. Compiler design

Another approach was taken by RevKit which has ¦ ¥ In order to keep as much code portability as possible recently been integrated into ProjectQ [38]. Instead using our mixed approach of high- and low-level instruc- of extending our eDSL with all the classical math tions, ProjectQ is implemented as a modular framework operations, it implements oracle functions in our which allows to adapt the compiler and intermediate rep- eDSL which take a Python function as a parameter: resentations to an application in order to better optimize the quantum program, see Fig. 7. Individual compiler §def classical_function(a,b,c,d): ¤ return (a and b) ^ (c and d) engines can be combined to make an application- and PhaseOracle(classical_function) | qubits hardware-specific compiler in order to best use the lim- ited quantum resources. In this section we will show how different intermediate representations can decrease the ¦When executed, our compiler traverses the AST of¥ Python to get the definition of the classical function and quantum resources for the example of Shor’s algorithm. then uses RevKit to synthesis a reversible version with In addition, this modularity of the compiler allows for or- Toffoli gates. For more information see the paper by ganic growth of the framework and its capabilities. While Soeken et al. [38]. this happens, standardized interfaces which support sev- eral quantum architectures emerge. Let us now turn to the second example where classical ProjectQ’s compiler engines receive the quantum pro- functions need to be executed by a classical controller gram in linear order and can transform the code before which is located close to hardware. A standard example sending it on to the next engine. Because quantum pro- are classical control instructions such as loops or repeat- grams become very large when translating them to low- until-success constructs [39]. We have implemented such level gates, it quickly becomes impossible for the host classical control flow instruction as meta-instructions computer to store the entire program in memory. As which means they use the code annotation feature a remedy, the compiler engines in ProjectQ only work in our eDSL. For example, loops can be specified as: on small parts of the code before sending it to the next engine and never require storage of the entire circuit. §with Loop(eng, 10): ¤ Despite this locality, global optimizations are enabled U | qubits through code annotations or automatic local optimiza- tions at higher levels of abstraction. These optimizations ¦This will send the command to apply U to the qubits¥ can be made more efficient by a smart choice of interme- annotated with the classical control instruction to diate gate sets. Fig. 8 and Fig. 9 show how this choice repeat it 10 times to the backend or if the backend affects the resource requirements. Shor’s algorithm is does not support loops, the compiler will unroll it compiled into a target gate set consisting of the two qubit automatically. Similarly a repeat-until-success of a CNOT gate and single qubit gates. To better differenti- specific measurement outcome can be added to our ate the cost of single qubit gates for a fault tolerant quan- eDSL. Because our programming language is embedded tum computer, we choose three categories: Clifford gates, into Python, one can also use Python to perform T-gates, and Rz-gates. Using for example the standard loops or post-process measurement outcomes which surface code error correction scheme T-gates are more then determine the next quantum operations. While expensive due to magic state distillation and Rz-gates this is currently possible when using a simulator as a are even more expensive as they will require gate syn- backend, it is not possible when running the program thesis involving several T-gates [22]. Our compiler did on a real quantum device as these classical operations not perform any rotation synthesis so the T-gates orig- need to be executed on a controller because they inate only from the decomposition of Toffoli gates. We require a lower latency to the quantum hardware. Our compiled Shor’s algorithm using no intermediate gate set eDSL can be extended to handle such a scenario by and hence the compiler first decomposes each operation introducing more elaborate classical meta-instructions: into the target gate set before optimizations take place. 7

Simulator Emulator Quantum Main Back-end Optimizer Translator Optimizer . . . Mapper Hardware Program Engine interface Circuit drawer Resource est. eDSL in Python Compiler Back-ends

Figure 7. (reprinted from [4]) ProjectQ’s full stack software framework. Users write their quantum programs in a high-level domain-specific language embedded in Python. The quantum program is then sent to the MainEngine, which is the front end of the modular compiler. The compiler consists of individual compiler engines which transform the code to the low-level instruction sets supported by the various back-ends, such as interfaces to quantum hardware, a high-performance quantum simulator and emulator, as well as a circuit drawer and a resource counter.

with IGS with IGS without IGS without IGS 106 106

105 105

#Rz gates 3 Ratio 3 Ratio #CNOT gates 2 2

1 1

0 0 104 1521 51 77 143 391 713 104 1521 51 77 143 391 713

15 21 51 77 143 391 713 15 21 51 77 143 391 713 N N

Figure 8. Comparison between the number of Rz gates which Figure 9. Comparison between the number of CNOT gates result from compilation with and without an intermediate which result from compilation with and without an interme- gate set (IGS), which allows for better optimization of the diate gate set (IGS), which allows for better optimization of circuit. See the text for the definition of the gate set and the circuit. See the text for the definition of the gate set and optimization procedure. optimization procedure.

In the second setting the compiler first decomposes the some of the hardware constraints in mind when writing algorithm into an intermediate gate set (IGS) consisting code. This is true for quantum programming as it is in of n-qubit QFT gates and arbitrary one- and two-qubit the classical case, where having a good idea of the mem- gates. We then perform an optimization step before de- ory hierarchy can improve performance by many orders composing into the target gate set at which point we of magnitude. perform another optimization round. With the IGS, the In this section, we consider mapping a quantum pro- inverse QFT and QFT gate of two successive Draper ad- gram to a machine model which only allows single- and dition circuits [36] can be canceled easily. This is the two-qubit gates, and where the latter can be executed main reason for the advantage of using an IGS in this only if the pair of qubits are neighbors on a given con- example. nectivity graph. Disjoint pairs of neighboring qubits can execute operations in parallel and we assume that the connectivity graph has low degree. In quantum comput- C. Mapping quantum programs to hardware with ers, the connectivity graph is determined by the hardware limited connectivity and – on future error corrected machines – by the cho- sen error correction code. For example, it is possible to A usual abstraction of high-level quantum program- build superconducting qubits on a linear nearest neigh- ming languages is that operations can be performed on bor chain because this allows to use a two dimensional any set of qubits. This abstraction is useful as the pro- chip design with control lines coming from the sides. A grammer can focus on logical operations without hav- technologically more advanced design is to have the con- ing to worry about the underlying hardware constraints. trol lines coming from the third dimension and hence Needless to say, it is never a bad idea if a programmer has allowing qubits to be connected on a nearest neighbor 8 square grid [40]. While for small experiments with tens Logical of gates and only a few qubits the mapping process could Mapped (1D) be optimized manually, this will no longer be the case for 106 Mapped (2D) hardware with more than 50 qubits especially as early devices might have irregular graphs due to faulty qubits. From the beginning, ProjectQ has been able to map small 105 algorithms to the IBM Quantum Experience chip with 5 qubits. We have extended the interface to facilitate the Ratios 104 25 #CNOT gates implementation and optimization of mappers. This will 20 allow to perform benchmarks for such mappers in the 15 10 future using the same algorithms, e.g., Shor’s algorithm 103 which is already implemented in ProjectQ or quantum 5 chemistry algorithms currently available in FermiLib [24] or its fork OpenFermion [41]. 1521 51 77 143 391 713 6557 64507 497009 There are three different and competing performance N metrics when optimizing mappers. First, a mapper might try to minimize the number of swap operations required Figure 10. Comparison between the number of CNOT gates to move qubits. This is a useful metric because swap before and after enforcing nearest neighbor connectivity for a operations can be implemented using, e.g., three CNOT 1D and 2D grid. gates which are noisy on today’s hardware. Second, a mapper can try to reduce the increase in circuit depth by applying as many swap operations in parallel as pos- sible because a lower circuit depth means faster run time. In principle, this also means that qubits require less co- Logical Mapped (1D) herence time but, on the other hand, the increase in swap Mapped (2D) operations might cancel out this effect. Third, one can increase the number of qubits in order to, e.g., keep the 105 circuit depth unchanged up to a constant factor [42]. We have implemented a mapper for a linear nearest

neighbor chain topology and a two-dimensional square Depth 7 Ratios grid. For both mappers, we focus on reducing the cir- 104 6 cuit depth without increasing the number of qubits. Our 5 4 mappers perform the mapping in three distinct phases: 3 2 1. Find a qubit placement on the hardware graph 1 which puts interacting qubits next to each other 1521 51 77 143 391 713 6557 64507 497009 2. Route the qubits from the old positions to the new N positions using swap operations Figure 11. Comparison between the circuit depth before and 3. Apply all the operations which act on single or after enforcing nearest neighbor connectivity for a 1D and 2D neighboring pairs of qubits grid. This procedure is repeated until all commands have been executed. We find the next qubit placement by a greedy search. Our heuristic tries to apply the first commands it en- counters by building a linear chain of qubits. For the our schemes are optimal as the furthest distance of two two dimensional square grid, this linear chain of qubits qubits on a linear√ chain is n and on a two-dimensional is then embedded into the two-dimensional grid using a square grid it is 2 n. snake pattern. Our routing schemes are both asymp- totically optimal. For a linear chain with n qubits, our We applied these two mappers to the circuit result- scheme uses a standard odd-even transposition sort [16] ing from the first loop iteration of Shor’s algorithm in which has a worst case circuit depth of n swap operations. Box 6 (for an n bit number, there are 2n such iterations For the two-dimensional square grid with n qubits we use which are almost identical). See Fig. 10 and Fig. 11 for the√ technique of [17, 18], which have a worst case depth a comparison between the circuit depth of an all-to-all of 3 n swaps if the grid has equal number of rows and connectivity compared to a linear nearest neighbor chain columns (it also works for rectangular grids), see Fig. 12 and a two-dimensional square grid. Both of these map- for a graphical explanation of the algorithm. It is easy to pers are available in ProjectQ such that one can improve see why (up to constants), the circuit depth overheads of and test them on more algorithms. 9

a) d) e) 0 3 2 7 6 2 initial columns final columns

u v 4 1 5 8 3 1 0 0

6 7 8 5 4 0 u1 v1

b) c) 6 7 2 7 6 2 u2 v2

4 1 8 8 4 1 matching 0 matching 1 0 3 5 5 3 0 matching 2

Figure 12. Routing procedure for a 2D nearest-neighbour connectivity. The goal is change the qubit placement depicted in a) to the new qubit placement depicted in d). This routing procedure has three steps, the qubits are numbered 0,...,8 and colored according to which final column they need to be moved. Firstly one permutes qubits within each column, a)→ b), such that the qubits in each row have a unique color, we explain below how to achieve this. Secondly, one permutes qubits within each row, b)→ c), so that the qubits are in the correct final column they belong to. Thirdly, one permutes qubits again within each column, c)→ d), in order to arrive at the final qubit placement. All the permutation within rows√ and columns are performed in parallel using an odd-even transposition√ sort [16] therefore as rows and columns each have n qubits, this routing is done in worst case with a circuit depth of 3 n steps. The only tricky part of this procedure is step 1, which has been described√ in [17]. One builds a bipartite graph (U,V,E) which has as many nodes ui and vj as the square grid has columns, in our case n. √An edge is added between node ui and vj if there is a qubit initially in column i which ends up in column j. This will create a n-regular bipartite graph. Hall’s marriage theorem shows that a regular bipartite graph always contains a perfect matching [43]. One finds a perfect matching√ in this bipartite graph (U,V,E) and labels it 0, then one removes the edges contained in this perfect matching√ and arrives at a n − 1-regular bipartite graph and finds again a perfect matching and labels it 1 and so on until one has n matchings and no more edges in the graph. Using the found matchings, it is possible to find the desired qubit placement of b) such that qubits in the same row have unique colors. The matching 0 determines which qubits go into row 0 within each column. The matching labeled 0 contains for each column i one edge going from node ui to some node vj which means that we move a qubit from within column i to row 0 if it has a final column destination of j (marked by the color). We then continue by assigning elements within each column to row 1 and so on. This then creates the required property that permuting only with each column, we arrive at a qubit placement in b) such that each row contains different colored qubits and the rest of the routing is then trivial.

IV. OUTLOOK ACKNOWLEDGMENTS

ProjectQ development continues to add and improve We acknowledge support by the Swiss National Science features, in particular more advanced heuristics for the Foundation and the Swiss National Competence Cen- mappers. From a performance point of view, ProjectQ is ter for Research QSIT. D.S.S. acknowledges very useful fast enough for the near term devices but we will continue discussions about mappers with , Harry to move performance heavy tasks to a high-performance Buhrman, Stephen Brierley, and Torsten Hoefler. C++ implementation. The only current performance bottleneck is estimating resources for large-scale quan- tum algorithms but this will be solved by introducing a hierarchical resource counter. 10

[1] Peter W Shor, “Algorithms for quantum computation: [17] Stephen Brierley, “Efficient implementation of quantum Discrete logarithms and factoring,” in Foundations of circuits with limited qubit interactions,” Quantum Info. , 1994 Proceedings., 35th Annual Sym- Comput. 17, 1096–1104 (2017). posium on (IEEE, 1994) pp. 124–134. [18] Mario Szegedy, private communication. [2] D. Adams, The Hitchhiker’s Guide to the Galaxy, Hitch- [19] C. P. Schnorr and A. Shamir, “An optimal sorting al- hiker’s Guide to the Galaxy (Guild Publishing, 1984). gorithm for mesh connected computers,” in Proceedings [3] Thomas H¨aner,Damian S. Steiger, Krysta Svore, and of the Eighteenth Annual ACM Symposium on Theory Matthias Troyer, “A software methodology for compiling of Computing, STOC ’86 (ACM, New York, NY, USA, quantum programs,” Quantum Science and Technology 1986) pp. 255–263. 3, 020501 (2018). [20] J. M. Hornibrook, J. I. Colless, I. D. Conway Lamb, S. J. [4] Damian S. Steiger, Thomas H¨aner,and Matthias Troyer, Pauka, H. Lu, A. C. Gossard, J. D. Watson, G. C. Gard- “ProjectQ: an open source software framework for quan- ner, S. Fallahi, M. J. Manfra, and D. J. Reilly, “Cryo- tum computing,” Quantum 2, 49 (2018). genic control architecture for large-scale quantum com- [5] John Preskill, “Quantum computing in the nisq era and puting,” Physical Review Applied 3, 024010 (2015). beyond,” arXiv preprint arXiv:1801.00862 (2018). [21] Frederic T Chong, Diana Franklin, and Margaret [6] Alexander S. Green, Peter LeFanu Lumsdaine, Neil J. Martonosi, “Programming languages and compiler de- Ross, Peter Selinger, and BenoˆıtValiron, “Quipper: a sign for realistic quantum hardware,” Nature 549, 180 scalable quantum programming language,” in ACM SIG- (2017). PLAN Notices, Vol. 48 (ACM, 2013) pp. 333–342. [22] Earl T. Campbell, Barbara M. Terhal, and Christophe [7] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Vuillot, “Roads towards fault-tolerant universal quantum Heckey, Alexey Lvov, Frederic T. Chong, and Margaret computation,” Nature 549, 172 (2017). Martonosi, “Scaffcc: a framework for compilation and [23] Indranil Banerjee and Dana Richards, “Routing and analysis of quantum computing programs,” in Proceed- sorting via matchings on graphs,” arXiv preprint ings of the 11th ACM Conference on Computing Fron- arXiv:1604.04978 (2016). tiers (ACM, 2014) p. 1. [24] “ProjectQ,” www.projectq.ch. [8] “IBM QISKit,” https://qiskit.org. [25] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man- [9] Robert S. Smith, Michael J. Curtis, and William J. Hong Yung, Xiao-Qi Zhou, Peter J. Love, Al´anAspuru- Zeng, “A practical quantum instruction set architecture,” Guzik, and Jeremy L. OBrien, “A variational eigenvalue arXiv preprint arXiv:1608.03355 (2016). solver on a photonic quantum processor,” Nature com- [10] Dave Wecker and Krysta M. Svore, “LIQUi |i: A soft- munications 5, 4213 (2014). ware design architecture and domain-specific language [26] Thomas H¨anerand Damian S Steiger, “0.5 petabyte sim- for quantum computing,” arXiv preprint arXiv:1402.4467 ulation of a 45-qubit quantum circuit,” in Proceedings of (2014). the International Conference for High Performance Com- [11] Krysta Svore, Alan Geller, Matthias Troyer, John puting, Networking, Storage and Analysis (ACM, 2017) Azariah, Christopher Granade, Bettina Heim, Vadym p. 33. Kliuchnikov, Mariia Mykhailova, Andres Paz, and Mar- [27] Thomas H¨aner,Damian S. Steiger, Mikhail Smelyanskiy, tin Roetteler, “Q#: Enabling scalable quantum comput- and Matthias Troyer, “High performance emulation of ing and development with a high-level dsl,” in Proceed- quantum circuits,” in Proceedings of the International ings of the Real World Domain Specific Languages Work- Conference for High Performance Computing, Network- shop 2018 (ACM, 2018) p. 7. ing, Storage and Analysis, SC ’16 (IEEE Press, Piscat- [12] Q# simulator kernels developed by ETH Zurich, away, NJ, USA, 2016) pp. 74:1–74:9. https://marketplace.visualstudio.com/items/ [28] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and quantum.DevKit/license, Accessed: 5 June 2018. Jay M. Gambetta, “Open quantum assembly language,” [13] Mehdi Saeedi, Robert Wille, and Rolf Drechsler, “Syn- arXiv preprint arXiv:1707.03429 (2017). thesis of quantum circuits for linear nearest neighbor ar- [29] C. H. Bennett, “Logical reversibility of computation,” chitectures,” Processing 10, 355– IBM J. Res. Dev. 17, 525–532 (1973). 377 (2011). [30] Stephane Beauregard, “Circuit for shor’s algorithm us- [14] Yuichi Hirata, Masaki Nakanishi, Shigeru Yamashita, ing 2n+3 qubits,” Quantum Info. Comput. 3, 175–185 and Yasuhiko Nakashima, “An efficient conversion of (2003). quantum circuits to a linear nearest neighbor architec- [31] Yasuhiro Takahashi and Noboru Kunihiro, “A quan- ture,” Quantum Information & Computation 11, 142– tum circuit for shor’s factoring algorithm using 2n + 2 166 (2011). qubits,” Quantum Info. Comput. 6, 184–192 (2006). [15] Alireza Shafaei, Mehdi Saeedi, and Massoud Pedram, [32] Thomas H¨aner,Martin Roetteler, and Krysta M. Svore, “Optimization of quantum circuits for interaction dis- “Factoring using 2n + 2 qubits with toffoli based modu- tance in linear nearest neighbor architectures,” in Pro- lar multiplication,” Quantum Info. Comput. 17, 673–684 ceedings of the 50th Annual Design Automation Confer- (2017). ence (ACM, 2013) p. 41. [33] Samuel A Kutin, “Shor’s algorithm on a nearest-neighbor [16] Nico Habermann, “Parallel neighbor-sort (or the glory machine,” arXiv preprint quant-ph/0609001 (2006). of the induction principle),” Computer Science Report, [34] Rodney Van Meter, Kohei M. Itoh, and Thaddeus Carnegie-Mellon University, Pittsburgh (1972). D. Ladd, “Architecture-dependent execution time of shor’s algorithm,” in MS+S 2006 - Controllable Quan- 11

tum States: Mesoscopic Superconductivity and Spintron- frey, Yan Yang, Anthony Yu, K. Arya, R. Barends, Zijun ics, Proceedings of the International Symposium (World Chen, B. Chiaro, A. Dunsworth, A. Fowler, C. Gidney, Scientific Publishing Co. Pte Ltd, 2008) pp. 183–188. M. Giustina, T. Huang, P. Klimov, M. Neeley, C. Neill, [35] Craig Gidney, “Factoring with n+ 2 clean qubits and n-1 P. Roushan, D. Sank, A. Vainsencher, J. Wenner, T. C. dirty qubits,” arXiv preprint arXiv:1706.07884 (2017). White, and John M. Martinis, “Qubit compatible super- [36] Thomas G Draper, “Addition on a quantum computer,” conducting interconnects,” Quantum Science and Tech- arXiv preprint quant-ph/0008033 (2000). nology 3, 014005 (2018). [37] Robert B. Griffiths and Chi-Sheng Niu, “Semiclassical [41] Jarrod R. McClean, Ian D. Kivlichan, Damian S. fourier transform for quantum computation,” Phys. Rev. Steiger, Yudong Cao, E. Schuyler Fried, Craig Gidney, Lett. 76, 3228–3231 (1996). Thomas H¨aner,Vojt˘ech Havl´ıˇcek,Zhang Jiang, Matthew [38] Mathias Soeken, Thomas Haener, and Martin Roetteler, Neeley, et al., “Openfermion: The electronic struc- “Programming quantum computers using design automa- ture package for quantum computers,” arXiv preprint tion,” in Design, Automation & Test in Europe Confer- arXiv:1710.07629 (2017). ence & Exhibition (DATE), 2018 (IEEE, 2018) pp. 137– [42] David Rosenbaum, “Optimal quantum circuits 146. for nearest-neighbor architectures,” arXiv preprint [39] Alex Bocharov, Martin Roetteler, and Krysta M. arXiv:1205.0036 (2012). Svore, “Efficient synthesis of universal repeat-until- [43] P. Hall, “On representatives of subsets,” Journal of the success quantum circuits,” Phys. Rev. Lett. 114, 080502 London Mathematical Society s1-10, 26–30. (2015). [40] B. Foxen, J. Y. Mutus, E. Lucero, R. Graff, A. Megrant, Yu Chen, C. Quintana, B. Burkett, J. Kelly, E. Jef-