<<

Factoring using 2n + 2 qubits with Toffoli based modular multiplication

Thomas H¨aner∗ Station Q Quantum Architectures and Computation Group, Microsoft Research, Redmond, WA 98052, USA and Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland

Martin Roetteler† and Krysta M. Svore‡ Station Q Quantum Architectures and Computation Group, Microsoft Research, Redmond, WA 98052, USA

We describe an implementation of Shor’s quantum algorithm to factor n-bit integers using only 2n+2 qubits. In contrast to previous space-optimized implementations, ours features a purely Toffoli based modular multiplication circuit. The circuit depth and the overall gate count are in (n3) and (n3 log n), respectively. We thus achieve the same space and time costs as TakahashiO et al. [1], Owhile using a purely classical modular multiplication circuit. As a consequence, our approach evades most of the cost overheads originating from rotation synthesis and enables testing and localization of faults in both, the logical level circuit and an actual quantum hardware implementation. Our new (in-place) constant-adder, which is used to construct the modular multiplication circuit, uses only dirty ancilla qubits and features a circuit size and depth in (n log n) and (n), respectively. O O

I. INTRODUCTION Cuccaro [5] Takahashi [6] Draper [4] Our adder Size Θ(n) Θ(n) Θ(n2) Θ(n log n) Quantum computers offer an exponential speedup over Depth Θ(n) Θ(n) Θ(n) Θ(n) their classical counterparts for solving certain problems, n the most famous of which is Shor’s algorithm [2] for fac- Ancillas n+1 (clean) n (clean) 0 2 (dirty) toring a large number N — an algorithm that enables the breaking of many popular encryption schemes including Table I. Costs associated with various implementations of ad- dition a a + c of a value a by a classical constant c. RSA. At the core of Shor’s algorithm lies a modular ex- | i 7→ | i ponentiation of a constant a by a superposition of values x stored in a register of 2n quantum bits (qubits), where the other hand, yield circuits with as few as 3n + (1) n = log2 N . Denoting the x-register by x and adding 3 O a resultd registere initialized to 0 , this can| bei written as qubits and (n ) size. Such classical reversible circuits | i have severalO advantages over Fourier-based arithmetic. x 0 x ax mod N . In particular, | i | i 7→ | i | i This mapping can be implemented using 2n condi- 1. they can be efficiently simulated on a classical com- tional modular multiplications, each of which can be re- puter, i.e., the logical circuits can be tested on a clas- placed by n (doubly-controlled) modular additions using sical computer, repeated-addition-and-shift [3]. For an illustration of the circuit, see Fig. 1. 2. they can be efficiently debugged when the logical level There are many possible implementations of Shor’s al- circuit is implemented in actual quantum hardware gorithm, all of which offer deeper insight into space/time implementation, and trade-offs by, e.g., using different ways of implementing the circuit for adding a known classical constant c to 3. they do not suffer from the overhead of single-qubit a quantum register a (see Table I). The implementa- rotation synthesis when employing QEC. | i tion given in Ref. [1] features the lowest known number 3 arXiv:1611.07995v1 [quant-ph] 23 Nov 2016 We construct our (n log n)-sized implementation of of qubits and uses Draper’s addition in Fourier space [4], Shor’s algorithm fromO a Toffoli based in-place constant- allowing factoring to be achieved using only 2n+2 qubits 3 4 adder, which adds a classically known n-bit constant c to at the cost of a circuit size in Θ(n log n) or even Θ(n ) the n-qubit quantum register a , i.e., which implements when using exact quantum Fourier transforms (QFT). a 0 a + c where a is an| arbitraryi n-bit input and Furthermore, the QFT circuit features many (controlled) a| +i |ciis 7→ an | n-biti output (the final carry is ignored). rotations, which in turn imply a large T-gate count when Our main technical innovation is to obtain space sav- quantum error-correction (QEC) is required. Implemen- ings by making use of dirty ancilla qubits which the cir- tations using classically-inspired adders as in Ref. [5], on cuit is allowed to borrow during its execution. By a dirty ancilla we mean—in contrast to a clean ancilla which is a qubit that is initialized in a known quantum state—a ∗ [email protected] qubit which can be in an arbitrary state and, in particu- † [email protected] lar, may be entangled with other qubits. In our circuits, ‡ [email protected] whenever such dirty ancilla qubits are borrowed and used 2

0 H H 0 H R1 H 0 H R2n 1 H | i | i ··· | i −

y = 1 a˜2n 1 a˜2n 2 a˜0 y | i − − ··· | i

Figure 1. Circuit for Shor’s algorithm as in [3], using the single-qubit semi-classical quantum Fourier transform from [7]. In 2i total, 2n modular multiplications bya ˜i = a mod N are required (denoted bya ˜i-gates in the circuit). The phase-shift gates 1 0  Pk−1 k−j Rk are given by with θk = π 2 mi, where the sum runs over all previous measurements j and mj 0, 1 0 eiθk − j=0 ∈ { } denotes the respective measurement result (m0 denotes the least significant bit of the final answer and is obtained in the first measurement). as scratch space they are then returned in exactly the the following statements is true: same state as they were in when they were borrowed. Our addition circuit requires (n log n) Toffoli gates O n ai = ci = 1, gi 1 = ai = 1, or gi 1 = ci = 1. and has an overall depth of O(n) when using 2 dirty − − ancillas and (n log n) when using 1 dirty ancilla. Fol- lowing BeauregardO [3], we construct a modular multipli- If c = 1, one must toggle g if a = 1, which can be cation circuit using this adder and report the gate counts i i+1 i achieved by placing a CNOT gate with target g and of Shor’s algorithm in order to compare our implementa- i+1 control a = 1. Furthermore, there may be a carry when tion to the one of Takahashi et al. [1], who used Fourier i ai = 0 but gi 1 = 1. This is easily solved by inverting ai addition as a basic building block. − and placing a Toffoli gate with target gi, conditioned on Our paper is organized as follows: in Section II, we ai and gi 1. If, on the other hand, ci = 0, the only way describe our Toffoli based in-place addition circuit, in- − of generating a carry is for ai = gi 1 = 1, which can be cluding parallelization. We then provide implementation solved with the Toffoli gate from before.− details of the modular addition and the controlled modu- lar multiplier in Section III, where we use the same con- Thus, in summary, one always places the Toffoli gate conditioned on gi 1 and ai, with target gi and, if ci = 1, structions as in [1, 3]. We verify the correctness of our − circuit using simulations and present numerical evidence one first adds a CNOT and a NOT gate. This classical for the correctness of our cost estimates in Section IV. conditioning during circuit-generation time is indicated Finally, in Section V, we provide arguments in favor of by colored gates in Fig. 2. In order to apply the Toffoli having Toffoli based networks in . gate conditioned on the toggling of gi 1, one places it before the potential toggling, and then− again afterwards such that if both are executed, the two gates cancel. Fi- nally, the borrowed dirty qubits and the qubits of a need II. TOFFOLI BASED IN-PLACE ADDITION to be restored to their initial state (except for the high- est bit of a, which now holds the result). This is done by running the entire circuit backwards, ignoring all gates One possible way to construct an (inefficient) adder is acting on an 1. − to note that one can calculate the final bit rn 1 of the result r = a + c using n 1 borrowed dirty− qubits g. One can easily save the qubit g0 in Fig. 2 by condi- By borrowed dirty qubits we− mean that the qubits are in tioning the Toffoli gate acting on g1 directly on the value an unknown initial state and must be returned to this of a0 (instead of testing for toggling of g0). If c0 = 0, state. Takahashi et al. hardwired a classical ripple-carry the two Toffoli gates can be removed altogether since the adder to arrive at a similar circuit, which they used to CNOT acting on g0 would not be present and the two optimize the modular addition in Shor’s algorithm [1]. Toffolis would cancel. If, on the other hand, c0 = 1, the We construct our CARRY circuit from scratch, which two Toffoli gates can be replaced by just one, conditioned allows to save (n) NOT gates as follows. on a0. See Fig. 3 for the complete circuit computing the O last bit of a when adding the constant c = 11. Since there is no way of determining the state of the g-register without measuring, one can only use toggling If one were to iteratively calculate the bits n 2, ..., 1, 0, of qubits to propagate information, as done by Barenco one would arrive at an O(n2)-sized addition circuit− using et al. for the multiply-controlled-NOT using just one bor- n 1 borrowed dirty ancilla qubits. This is the same rowed dirty ancilla qubit [8]. We choose to encode the size− as the Fourier addition circuit [4], unless one uses an carry using such qubits, i.e., the toggling of qubit gi, approximate version of the quantum Fourier transform n which we denote as gi = 1, indicates the presence of a bringing the size down to (n log ε ) [9]. We improve carry from bit i to bit i + 1 when adding the constant c our construction to arrive atO a size in (n log n) in the O to the bits of a. Thus, gi must toggle if (at least) one of next subsection. 3

n a0 a0 d 2 e | i | i x +cL r | Li C C | Li g0 g˜0 n A A | i | i b 2 c xH R +1 R +cH rH | i | i a a˜ R R | 1i | 1i 0 Y Y 0 g g˜ | i | i | 1i | 1i ...... Figure 4. Circuit for adding the constant a to the register x. xH and xL denote the high- and low-bit part of x. The gn 3 g˜n 3 CARRY gate computes the carry of the computation xL + | − i ··· | − i aL into the qubit with initial state 0 , borrowing the xH qubits as dirty ancillae. This carry is then| i taken care of by an an 2 a˜n 2 | − i ··· | − i incrementer gate acting on the high-bits of x. Applying this construction recursively yields an (n log n) addition circuit O gn 2 g˜n 2 with just one ancilla qubit (the 0 qubit in this figure). | − i ··· | − i | i

an 1 rn 1 | − i ··· | − i using n borrowed ancilla qubits in an unknown initial state g : Figure 2. Circuit computing the last bit of r = a + c using | i dirty qubits g. An orange (dark) gate acting on a qubit with x g x g g index i must be dropped if the i-th bit of the constant c is | i | i 7→ | − i | i 0. The entire sequence must be run again in reverse (without x g g0 1 the gates acting on an−1) in order to reset all qubits to their 7→ | − i | − i x g g0 + 1 g0 1 initial value except for rn−1. 7→ | − − i | − i x + 1 g , 7→ | i | i

n a a a a | i | i | 0i | 0i where g0 denotes the two’s-complement of g and g0 1 = − a a | 1i | 1i g, the bit-wise complement of g. A conditional incre-

g g menter can be constructed by either using two controlled C | 0i | 1i adders as explained, or by applying an incrementer to a A c = 11 a2 a2 n 1 | i | i g − g | i R | i register containing both the target and control qubits of ≡ g1 g2 R | i | i the conditional incrementer, where the control qubit is Y a3 a3 | i | i now the least significant bit [10]. Then, the incrementer g g | 2i | 3i can be run on the larger register, followed by a final NOT 0 (a + c) 0 (a + 11) | i | ni | i | 4i gate acting on the control qubit (since it will always be toggled by the incrementer). In the latter version, one Figure 3. Example circuit computing the final carry of r = can either use one more dirty ancilla qubit for cases where a + 11 derived from the construction depicted in Fig. 2. The n mod 2 = 0 or, alternatively, split the incrementer into binary representation of the constant c is c = 11 = 10112, two smaller ones as done in Ref. [10]. We will use the i.e., the orange gates in Fig. 2 acting on qubit index 2 have construction with an extra dirty qubit, since there are been removed since c2 = 0. Furthermore, the optimization plenty of idle qubits available in Shor’s algorithm. mentioned in the text has been applied, allowing to remove In order to make the circuit depicted in Fig. 4 work g0 in Fig. 2. with a borrowed dirty qubit, the incrementer has to be run twice with a conditional inversion in between [10]. The resulting circuit can be seen in Fig. 5. At the lowest A. Serial implementation recursion level, only 1-bit additions are performed, which can be implemented as a NOT gate on xi if ci = 1; all carries are accounted for earlier. An (n log n)-sized addition circuit can be achieved by applyingO a divide-and-conquer scheme to the straight- forward addition idea mentioned above (see Fig. 4), to- gether with the incrementer proposed in [10], which runs B. Runtime analysis of the serial implementation in (n). Since we have many dirty ancillae available in ourO recursive construction, the n-borrowed qubits incre- In the serial version, we always reuse the one borrowed menter in [10] is sufficient: Using the ancilla-free adder by dirty ancilla qubit to hold the output of the CARRY gate, Takahashi [6], which requires no incoming carry, and its which is implemented as shown in Fig. 3. The CARRY reverse to perform subtraction, one can perform the fol- gate has a Toffoli count of Tcarry(n) = 4n + (1) (in- lowing sequence of operations to achieve an incrementer cluding the uncomputation of the ancilla qubits)O and the 4

n d 2 e x +cL r b r%N | Li C C | Li | i | i n A A b 2 c Adda xH +1 R +1 R +cH rH CMP CMP | i | i 0 or 0 R R | i (N a) (a) | i g Y Y g − SubN a | i | i − g g | i | i Figure 5. The circuit of Fig. 4 for the case when the ancilla qubit is dirty (unknown initial state g , left unchanged by the computation). | i Figure 6. Taken from [1]: Construction of a modular adder b r mod N with r = a + b, using a non-modular adder. The| i 7→ CMP | gate comparesi the value in register b to the classical value N a, which we implement using our carry gate. The controlled incrementer using n borrowed dirty qubits fea- result indicates− whether b < N a, i.e., it indicates whether tures a Toffoli count of T (n) = 4n + (1) (2 addi- − incr O we must add a or a N. Finally, the indicator qubit is reset tions). Both of these circuits have to be run twice on to 0 using another− comparison gate. In our implementa- n | i roughly 2 qubits. Therefore, the first part of the recur- tion, the add/subtract operation uses between 1 (serial) and sion has a Toffoli count of T (n) = 8n + (1). The n (parallel) qubits of g. rec O 2 recursion for the Toffoli count Tadd(n) of the entire addi- tion circuit thus yields lnm jnk shift approach, as done in Refs. [1, 3]: T (n) = T + T + T (n) add add 2 add 2 rec n 1 0 = 8n log2 n + (n). (ax) mod N = (a(xn 12 − + + x02 )) mod N O − ··· n 1 = (((a2 − ) mod N)xn 1 + + ax0) , For a controlled addition, only the two CNOT gates − ··· acting on the last bit in Fig. 3 need to be turned into their controlled versions, which is another nice property where xn 1, ..., x0 is the binary expansion of x, and ad- − of this construction. dition is carried out modulo N. Since xi 0, 1 , this can be viewed as modular additions of (a2i∈) mod { N} con- ditioned on xi = 1. The transformation by Takahashi et C. Parallel / Lower-depth version al. [1] allows to construct an efficient modulo-N addition circuit from a non-modular adder. For an illustration of the procedure see Fig. 6, where the comparator can If the underlying hardware supports parallelization, be implemented by applying our carry circuit on the in- one can compute the carries for the additions +c and L verted bits of b. Also, note that it is sufficient to turn the +c in Fig. 4 in parallel, at the cost of one extra qubit in H final CNOT gates (see Fig. 3) of the comparator in Fig. 6 state 0 which will then hold the output of the CARRY into Toffoli gates in order to arrive at controlled modular computation| i of +c . Doing this recursively and noting H addition, since the subsequent add/subtract operation is that there must be at least two qubits of x per CARRY executed conditionally on the output of the comparator. gate, one sees that this circuit can be parallelized at a n The repeated-addition-and-shift algorithm transforms cost of 2 ancilla qubits in state 0 . Using the construc- tion depicted in Fig. 5 allows us| toi use n borrowed dirty the input registers x 0 x (a x) mod N . In Shor’s 2 | i | i 7→ | i | · i qubits instead. To see that this construction can be used algorithm, 2n such modular multiplications are required in our implementation of Shor’s algorithm, consider that and in order to keep the total number of 2n+2 qubits con- during the modular multiplication stant, the uncompute method from Ref. [3] can be used: After swapping the two n-qubit registers, one runs an- x 0 x (ax) mod N , other modular multiplication circuit, but this time using | i | i 7→ | i | i subtraction instead of addition and with a new constant 1 we perform additions into the second register, condi- of multiplication, namely the inverse a− of a modulo N. tioned on the output of the comparator in Fig. 6. There- This achieves the transformation fore, n qubits of the x-register are readily available to be used as borrowed dirty qubits, thus reducing the depth x (ax) mod N (ax) mod N x | i | i 7→ | i | i of our addition circuit to (n). 1 O (ax) mod N (x a− ax) mod N 7→ | i − = (ax) mod N 0 , | i | i III. MODULAR MULTIPLICATION as desired. In total, this procedure requires 2n+1 qubits: The modular multiplier can be constructed from a 2n for the two registers and 1 to achieve the modular modular addition circuit using a repeated-addition-and- addition depicted in Fig. 6. 5

IV. IMPLEMENTATION AND 1 1012 × #Toffoli gates EXPERIMENTAL RESULTS 1 1011 32.01n2 log n + 14.73n2 × 1 1010 × In Shor’s algorithm, a controlled modular multiplier is 1 109 needed for the modular exponentiation which takes the × 1 108 form of a quantum phase estimation, since × 1 107 × n 1 n 2 6 x 2 − xn 1+2 − xn 2 +x0 1 10 a mod N = a − − ··· mod N #Toffoli gates × 100000 n 1 n 2 2 − xn 1 2 − xn 2 x0 = a − a − a , · ··· 10000 1000 where again xi 0, 1 and multiplication is carried ∈ { } 100 out modulo N. Thus, modular exponentiation can be 10 100 1000 10000 achieved using modular multiplications by constantsa ˜i #bits n conditioned on xi = 1, where

Figure 7. Scaling of the Toffoli count TM (n) with bit size n for 2i a˜i = a mod N. the controlled modular multiplier. Each data point represents a modular multiplication run (including uncompute of the x- We do not have to condition our inner-most adders; we register) using n = 2m bits for each of the two registers, with m 3, ..., 13 . can get away with adding two controls to the comparator ∈ { } gates in Fig. 6, which turns the CNOT gates acting on the last bit in Fig. 3 into 3-qubit-controlled-NOT gates, which can be implemented using 4 Toffoli gates and one V. ADVANTAGES OF TOFFOLI CIRCUITS of the idle garbage qubits of g [8]. Note that there are n idle qubits available when performing the controlled A. Single-qubit rotation gate synthesis addition/subtraction in Fig. 6 (n 1 qubits in g plus − the xi qubit the comparator was conditioned upon). The In order to apply a QEC scheme, arbitrary rotation controlled addition/subtraction circuit can thus borrow gates have to be decomposed into sequences of gates from n 2 dirty qubits from the g register to achieve the paral- a (universal) discrete gate set – a process called gate lelism mentioned in subsection II B. synthesis – for which an algorithm such as the ones in We implemented the controlled modular-multiplier Refs. [12–14]. may be used. One example of a universal performing the operation discrete gate set consists of the Pauli gates (σx, σy, σz), 1 0  CNOT, Hadamard, and the T gate 0 eiπ/4 . This syn- x 0 x (ax) mod N 1 | i | i 7→ | i | i thesis implies a growth on the order of Θ(log ε ) in the (ax) mod N 0 total number of gates, where ε denotes the target preci- 7→ | i | i sion of the synthesis. in the LIQUi quantum software architecture [11]. We In space-efficient implementations of Shor’s algorithm |i extended LIQUi by a reversible circuit simulator to en- by Beauregard [3] and Takahashi et al. [1], the angles of |i able large scale simulations of Toffoli based circuits. the approximate QFT (AQFT) require synthesis. Since To test our circuit designs and gate estimates, we sim- the approximation cutoff is introduced at m Θ(log n), ulated our circuits on input sizes of up to 8, 192-bit num- the smallest Fourier angles are θ Θ( 1 ) [15].∈ Hence, m ∈ 2m bers. The scaling results of the Toffoli count Tmult(n) of the target precision ε of the synthesis can be estimated to 1 our controlled modular-multiplier are as expected. Each be in Θ( n ) and, thus, the overall gate count and depth of of the two (controlled) multiplication circuits (namely the previous circuits by Takahashi et al. and Beauregard compute/uncompute) use n (doubly-controlled) modular are in Θ(n3 log2 n) and Θ(n3 log n), respectively. additions. The modular addition is constructed using Toffoli based networks, on the other hand, do not suffer two (controlled) addition circuits, which have a Toffoli from synthesis overhead. A Toffoli gate can be decom- count of Tadd(n) = 8n log n + (n). Thus we have posed exactly into Clifford and T gates using 7 T-gates, 2 O or less if global phases can be ignored [8, 16]. There- 2 2 Tmult(n) = 4nTadd(n) = 32n log n + (n ) . fore, the overall gate count and depth of our circuit do 2 O not change and remain in Θ(n3 log n) and Θ(n3), respec- The experimental data and the fit confirm this ex- tively. pected scaling, as shown in Fig. 7. Since 2n modular multiplications have to be carried out for an entire run of Shor’s algorithm, the overall Toffoli count is B. Design for testability

T (n) = 64n3 log n + (n3) . Shor 2 O In classical computing, thoroughly tested hardware and software components are preferred over the ones that 6 are not, especially for applications where system-failure VI. SUMMARY AND OUTLOOK could have catastrophic effects. The same may be true for quantum computing: Both software and hardware will We present a Toffoli based in-place addition circuit need to be tested in order to guarantee the correctness of which can be used to implement Shor’s algorithm us- each and every component involved in a computation for ing 2n + 2 qubits. Our implementation features a size in building large circuits such as the ones used for factoring (n3 log n), and a depth in (n3). In contrast to pre- using Shor’s algorithm. While a full functional simula- Ovious space-efficient implementationsO [1, 3], our modular tion may be possible for arbitrary circuits up to almost multiplication circuit only consists of Toffoli and Clifford 50 qubits with high-performance simulators run on super- gates. In addition to facilitating the process of debugging computers [17, 18], simulating a moderately-sized future future implementations, having a Toffoli based circuit quantum computer with just 100 qubits is not feasible also eliminates the need for single-qubit-rotation synthe- on a (future) classical computer due to the exponential sis when employing quantum error-correction. This re- scaling of the required resources. For Toffoli networks, sults in a better scaling of both size and depth by a factor on the other hand, classical reversible simulators can be in Θ(log n). used, which run the circuit on a computational basis state Our main technical innovation is the implementation and only update one single state for each gate. This en- of an addition by a constant that can be performed in ables thorough testing of logical level circuits such as the n (n log n) operations and that needs 2 ancillas, all of modular multiplication circuit presented in this paper. whichO can be dirty, i.e., can be taken from other parts of Furthermore, when Toffoli networks are run on actual the computation that are currently idle. quantum hardware, the circuits can be debugged effi- As mentioned in [1], it would be interesting to see ciently and faults can be localized using binary search. whether one can find a linear-time constant-adder which The idea is to run the actual physical implementation of does not require Θ(n) clean ancilla qubits. This would al- the network (followed by final measurement in the com- low to decrease the size of our circuit to its current depth putational basis) on a sample set of basis states which of (n3) without having to increase the total number of serve as test vectors to trigger the faults. As the distri- qubitsO to 3n + 2. Also, similar to [19, 20], it would be bution under correct operation is always close to a delta interesting to find implementations of Shor’s algorithm function in total variation distance, we have an efficient that are geometrically constrained but yet make use of test if an error occurred in the entire network. If so, one dirty ancillas to reduce the overall number of qubits re- can subdivide the Toffoli networks in two parts and recur- quired. sively apply the procedure to both parts, using as input the ideal vectors obtained from the logical level simula- tion. Eventually this will lead to localization of all faults ACKNOWLEDGMENTS that might be present in the implementation, provided that the choice of the initial sample set triggers all faults that might be present in the network. Note that this We thank Alan Geller, Dave Wecker, and Nathan kind of debugging would not be possible for, e.g., QFT- Wiebe for helpful discussions. based addition circuits, as intermediate states might be in superposition of exponentially many basis states.

[1] Yasuhiro Takahashi and Noboru Kunihiro. A quan- arXiv preprint arXiv:0910.2530, 2009. tum circuit for shor’s factoring algorithm using 2n+ 2 [7] Robert B Griffiths and Chi-Sheng Niu. Semiclassical qubits. & Computation, 6(2):184– fourier transform for quantum computation. Physical Re- 192, 2006. view Letters, 76(17):3228, 1996. [2] Peter W Shor. Algorithms for quantum computation: [8] Adriano Barenco, Charles H Bennett, Richard Cleve, Discrete logarithms and factoring. In Foundations of David P DiVincenzo, , , Ty- Computer Science, 1994 Proceedings., 35th Annual Sym- cho Sleator, John A Smolin, and Harald Weinfurter. Ele- posium on, pages 124–134. IEEE, 1994. mentary gates for quantum computation. Physical review [3] Stephane Beauregard. Circuit for shor’s algorithm using A, 52(5):3457, 1995. 2n+ 3 qubits. Quantum Information & Computation, [9] Richard Cleve and John Watrous. Fast parallel circuits 3(2):175–185, 2003. for the quantum fourier transform. In Foundations of [4] Thomas G Draper. Addition on a quantum computer. Computer Science, 2000. Proceedings. 41st Annual Sym- arXiv preprint quant-ph/0008033, 2000. posium on, pages 526–536. IEEE, 2000. [5] Steven A Cuccaro, Thomas G Draper, Samuel A Kutin, [10] Craig Gidney. StackExchange: Creating bigger con- and David Petrie Moulton. A new quantum ripple-carry trolled nots from single qubit, toffoli, and cnot gates, addition circuit. arXiv preprint quant-ph/0410184, 2004. without workspace. 2015. [6] Yasuhiro Takahashi, Seiichiro Tani, and Noboru Kuni- [11] Dave Wecker and Krysta M Svore. LIQUi : A soft- hiro. Quantum addition circuits and unbounded fan-out. ware design architecture and domain-specific|i language 7

for quantum computing. arXiv preprint arXiv:1402.4467, [16] N. Cody Jones. Novel constructions for the fault-tolerant 2014. toffoli gate. Physical Review A, 87:022328, 2013. [12] Vadym Kliuchnikov, Dmitri Maslov, and Michele Mosca. [17] Thomas H¨aner, Damian S Steiger, Mikhail Smelyan- Practical approximation of single-qubit unitaries by skiy, and Matthias Troyer. High performance emulation single-qubit quantum clifford and t circuits. arXiv of quantum circuits. arXiv preprint arXiv:1604.06460, preprint arXiv:1212.6964, 2012. 2016. [13] Neil J Ross and Peter Selinger. Optimal ancilla-free clif- [18] Mikhail Smelyanskiy, Nicolas PD Sawaya, and Al´an ford+ t approximation of z-rotations. arXiv preprint Aspuru-Guzik. qhipster: the quantum high perfor- arXiv:1403.2975, 2014. mance software testing environment. arXiv preprint [14] Alex Bocharov, Martin Roetteler, and Krysta M Svore. arXiv:1601.07195, 2016. Efficient synthesis of probabilistic quantum circuits with [19] Rodney van Meter and Kohei M. Itoh. Fast quantum fallback. Physical Review A, 91(5):052317, 2015. modular exponentiation. Physical Review A, 2005. [15] Adriano Barenco, Artur Ekert, Kalle-Antti Suominen, [20] Samuel A Kutin. Shor’s algorithm on a nearest-neighbor and P¨aiviT¨orm¨a. Approximate quantum fourier trans- machine. arXiv preprint quant-ph/0609001, 2006. form and decoherence. Physical Review A, 54(1):139, 1996.