<<

arXiv:quant-ph/0408006v2 29 Mar 2005 O fwihcnit of consists which of ihtecb ftelnt nbt ftenme ob fac- be to number the scales of of operations bits consists It in or 7]. length 6, gates the [5, quantum tored of of cube the number with the is, that isi piiigfragvnsto osrit.Ti paper This constraints. of set tradeoffs. given those a explores practi- quantitatively for be optimizing enginee to interesting in much expected lies so not future, foreseeable are the qubits for cal separate of on numbers concurrently large executed Such be can gates of ber saiyrequire essarily n lcsw xett eueu o ayalgorithms. many for useful be to bui expect fundamental we are blocks ing circuits al the of because part and intensive rithm, computationally most the be is both it , cause will modular we quantum paper the this on In concentrate transform. Fourier fo quantum exponentiation, the modular by quantum lowed con parts, two factoring main Shor’s of sists . exe ar- quantum parallel between of especially relationship tion will efficiency, the we program examine to and important, example chitecture and an [1]. defined as field well it the in use is date algorithm to result this famous Since most the perhaps is time begun. architectur just the only has to computers principles quantum architecture of application computer the but sical nat- architecture, computers the 4] on quantum 3, depends for 2, urally programs [1, writing time of computational process in The gains enormous of sibility O ti eaieysrihfradt e o ordc ahof each reduce to to how layers see three to those straightforward relatively is it oesaeadmore and space more ∗ lcrncades [email protected] address: Electronic ( (log udmnal,qatmmdlrepnnito is exponentiation modular quantum Fundamentally, polynomial in numbers large factoring for algorithm Shor’s pos- the by motivated is quantum in Research n ) 3 prtos However, operations. n ) if O ASnmes 36.x 70.x 89.20.Ff 07.05.Bx, 03.67.Lx, numbers: PACS experi guide help will an devices. results have Our device behavior. computing asymptotic a and of characteristics physical that piie esoso h ai loihs eedn nar on depending algorithms, the of versions optimized ocrec,adsaetaefs efidta oexponentia to that find We al tradeoffs. arithmetic space Various and concurrency, execution. gate concurrent possible ie h minimum the times nyacietr slmtdto limited is architecture only ( n epeetadtie nlsso h mato unu modul quantum on impact the of analysis detailed a present We O 3 ) ( n .INTRODUCTION I. uisaeaalbeada rirr num- arbitrary an and available are qubits 3 O total ) O ( iesteps time n (log ) ae,gvn oa unn ieof time running total a giving gates, 1 diin,ec fwihrequires which of each additions, rdaeSho fSineadTcnlg,Ki University Keio Technology, and Science of School Graduate Dtd umte uy2,20;rvsdFb ,20;accep 2005; 9, Feb. revised 2004; 28, July submitted (Dated: 2 n n -41Hyuh,Khk-u ooaasi aaaa223-8 Kanagawa Yokohama-shi, Kohoku-ku, Hiyoushi, 3-14-1 O ) 5 oua utpiain,each multiplications, modular ( ieses necag for exchange in steps, time n n ,w a xct oua xoetaintohnrdt seve to hundred two exponentiation modular execute can we ), na btatmachine, abstract an On . 3 ) operations atQatmMdlrExponentiation Modular Quantum Fast O onyVnMeter Van Rodney ( n ) iewiennniho rhtcue a reach can architectures non-neighbor while time ontnec- not do O ( n of e ring las- go- cu- 3 ld- ) l- - - ; . . 1, ∗ n oe .Itoh M. Kohei and ae nShohg-tasni nyfse hnbasic than faster only Sch¨onhage-Strassen is on t exponentiat based importantly, modular quantum Most [42]: matter reasons. factors several constant for misleading qualifi is further tion without Sch¨onhage-Strassen citing simply e utdi unu optn eerha being as research computing quantum in O quoted ten e time. of coherence strength and necessary correction the bac ror for feeds be- requirements turn, or system in the weeks, difference, into This and hours. time and computation seconds tween of diffe hours the between be ence may algorithm specific exponentiation a of for choice nology, rate genera- clock several effective corre error application-level next post-quantum tion, the the on over Depending grows devices. of qubits tions of number devices the computing as quantum actual of implementation mental eie ob fdepth paral- of be be can to (CNOTs) lelized Control-NOTs com of circuit any exclusively that posed proving concurrently, g performed which be show can and circuits, parallelizable certain describe tspot CO tetreqbtTfoigt,o Control or gate, Toffoli three- (the CCNOT supports It or iso en h opttoa opeiyclass complexity computational the Zalka’s define an and Nilsson Moore [10], [9], [11]. arithmetic transform implementation Sch¨onhage-Strassen-based carry-save Fourier quantum quantum the Gossett’s pa Watrous’ of Cir- and Cleve itself. implementation in number allel considered the explicitly than is smaller depth be cuit can gates ex of to number depth, a circuit or time. latency, same the the concurrency, at Utilizing qubits independent on gate quantum one than ae,w ilcnetaeo mle rbe ie,adex- and sizes, problem than smaller rather on act, concentrate 32 will approximately we paper, than more for algorithms oto rtre o nyoegt tatm.Tefis,the first, The time. a at gate one only the be for can target qubit or any perfor control assume understand we architectures, us both help For that mance. features important some with [12]. machine abstract oihsaeeautdfreeuintm,potential time, execution for evaluated are gorithms etlipeettoso unu loihsand algorithms quantum of implementations mental ean te ( motn mato ohra-ol unn time running real-world both on impact important htcue for chitecture, h c¨naeSrse utpiainagrtmi of- is algorithm multiplication Sch¨onhage-Strassen The experi- and design the guide help to intended is paper This ocretqatmcmuaini h xcto fmore of execution the is computation quantum Concurrent eaayetosprt rhtcue,silasrc but abstract still architectures, separate two analyze We n btatConcurrent Abstract log repnnito facietrlfaue and features architectural of exponentiation ar n btnme,frsoaespace storage for number, -bit n o log log O n e a.4 2005.) 4, Mar. ted ( n 128 = · 1 ) ) n CREST-JST and performance. , o igemlilcto 8.However, [8]. multiplication single a for 2,Japan 522, O ude ie atrthan faster times hundred n rhtcue sorasrc model. abstract our is architecture, , diino neighbor- a on Addition . (log O (log n ) n ) using demonstrating , 100 n O (twenty ( n 2 ) kilobits nileo an on ancillae nthis In . QNC O ecute tech- ( AC ates n ion ca- he 3 to c- - r- r- k ) - - - , 2

II. BASIC CONCEPTS = = A. and Shor's Algorithm

AC NTC Shor’s algorithm for factoring numbers on a quantum com- puter uses the quantum Fourier transform to find the order r FIG. 1: CCNOT constructions for our architectures AC and NTC. The box with the bar on the right represents the square root of X, of a randomly chosen number x in the multiplicative group and the box with the bar on the left its adjoint. Time flows left to (modN). This is achieved by exponentiating x, N, right, each horizontal line represents a qubit, and each vertical line for a superposition of all possible exponents a. Therefore, segment is a quantum gate. efficient arithmetic algorithms to calculate modular exponen- tiation in the quantum domain are critical. Quantum modular exponentiation is the evolution of the Control-NOT), arbitrary concurrency, and gate operands any state of a quantum computer to hold distance apart without penalty. It does not support arbitrary ψ 0 ψ xψ mod N (1) control strings on control operations, only CCNOT with two | i| i → | i| i ones as control. The second, the NT C, or Neighbor-only, When ψ is the superposition of all input states a up to a Two-qubit-gate, Concurrent architecture, is similar but does particular| valuei 2N 2, not support CCNOT, only two-qubit gates, and assumes the qubits are laid out in a one-dimensional line, and only neigh- 2N 2 1 boring qubits can interact. The 1D layout will have the high- ψ = a (2) | i √ | i est communications costs among possible physical topolo- N 2 a=0 gies. Most real, scalable architectures will have constraints X with this flavor, if different details, so AC and NT C can be The result is the superposition of the modular exponentia- viewed as bounds within which many real architectures will tion of those input states, fall. The layout of variables on this structure has a large im- 2N 2 2N 2 pact on performance; what is presented here is the best we 1 1 a 0 a xa mod N (3) have discovered to date, but we do not claim it is optimal. √ | i| i → √ | i| i N 2 a=0 N 2 a=0 The NT C model is a reasonable description of sev- X X eral important experimental approaches, including a one- Depending on the algorithm chosen for modular exponen- dimensional chain of quantum dots [13], the original Kane tiation, x may appear explicitly in a register in the quantum proposal [14], and the all- NMR device [15]. Super- computer, or may appear only implicitly in the choice of in- conducting qubits [16, 17] may map to NT C, depending on structions to be executed. the details of the qubit interconnection. In general, quantum modular exponentiation algorithms are The difference between AC and NT C is critical; beyond created from building blocks that do modular multiplication, the importantconstant factors as nearby qubits shuffle, we will see in section III B that can achieve performance α 0 α αβ mod N (4) AC O(log n) | i| i → | i| i where NT C is limited to O(n). where β and N may or may not appear explicitly in quantum For NT C, which does not support CCNOT directly, we registers. This modular multiplication is built from blocks that compose CCNOT from a set of five two-qubit gates [18], as perform modular addition, shown in figure 1. The box with the bar on the right represents 1+ i 1 i the square root of X, √X = 1 and the box α 0 α α + β mod N (5) 2 1 i 1+− i | i| i → | i| i  −  with the bar on the left its adjoint. We assume that this gate which, in turn, are usually built from blocks that perform ad- requires the same execution time as a CNOT. dition and comparison. Section II reviews Shor’s algorithm and the need for mod- Addition of two n-bit numbers requires O(n) gates. Mul- ular exponentiation, then summarizes the techniques we em- tiplication of two n-bit numbers (including modular multipli- ploy to accelerate modular exponentiation. The next subsec- cation) combines the convolution partial products (the one-bit tion introduces the best-known existing modular exponentia- products) of each pair of bits from the two arguments. This tion algorithms and several different adders. Section III be- requires O(n) additions of n-bit numbers, giving a gate count gins by examining concurrency in the lowest level elements, of O(n2). Our exponentiation for Shor’s algorithm requires the adders. This is followed by faster adders and additional 2n multiplications, giving a total cost of O(n3). techniques for accelerating modulo operations and exponenti- Many of these steps can be conducted in parallel; in classi- ation. Section IV shows how to balance these techniques and cal computer system design, the latency or circuit depth, the apply them to a specific architecture and set of constraints. time from the input of values until the output becomes avail- We evaluate several complete algorithms for our architectural able, is as important as the total computational complexity. models. Specific gate latency counts, rather than asymptotic Concurrency is the execution of more than one gate during the values, are given for 128 bits and smaller numbers. same execution time slot. We will refer to the number of gates 3 executing in a time slot as the concurrency or the concurrency Do multiplications concurrently (linear speedup for level. Our goal throughthe rest of the paper is to exploit paral- • small values, linear cost in space, small gate count in- lelism, or concurrency, to shorten the total wall clock time to crease; requires quantum-quantum (Q-Q) multiplier, as execute modular exponentiation, and hence Shor’s algorithm. well as classical-quantum (C-Q) multiplier) (sec. IIIC). The algorithms as described here run on logical qubits, 2 which will be encoded onto physical qubits using quantum Move to e.g. carry-save adders (n space penalty for • error correction (QEC) [19]. Error correction processes are reduction to log time, increases total gate count)([10], generally assumed to be applied in parallel across the en- sec. IIC4) conditional-sum adders (sec. IIIB2), or tire machine. Executing gates on the encoded qubits, in carry-lookahead adders (sec. IIC5). some cases, requires additional ancillae, so multiple concur- Reduce modulo comparisons, only do subtract N on rent logical gates will require growth in physical qubit storage • space [20, 21]. Thus, both physical and logical concurrency overflow (small space penalty, linear reduction in mod- are important; in this paper we consider only logical concur- ulo arithmetic cost) (sec. IIID). rency.

C. Existing Algorithms B. Notation and Techniques for Speeding Up Modular Exponentation In this section we will review various components of the modular exponentiation which will be used to construct our In this paper, we will use N as the number to be factored, parallelized version of the algorithm in section III. There are and n to represent its length in bits. For convenience, we will many ways of building adders and multipliers, and choos- assume that n is a power of two, and the high bit of N is one. ing the correct one is a technology-dependent exercise [30]. x is the random value, smaller than N, to be exponentiated, Only a few classical techniques have been explored for quan- and a is our superposition of exponents, with a < 2N 2 so tum computation. The two most commonly cited modular that the| i length of a is 2n +1 bits. exponentiation algorithms are those of Vedral, Barenco, and When discussing circuit cost, the notation is Ekert [7], which we will refer to as VBE, and Beckman, (CCNOTs; CNOTs; NOTs) or (CNOTs; NOTs). Chari, Devabhaktuni, and Preskill [5], which we will refer to The values may be total gates or circuit depth (la- as BCDP. Both BCDP and VBE algorithms build multipliers tency), depending on context. The notation is some- from variants of carry-ripple adders, the simplest but slowest times enhanced to show required concurrency and space, method; Cuccaro et al. have recently shown the design of (CCNOTs; CNOTs; NOTs)#(concurrency; space). a smaller, faster carry-ripple adder. Zalka proposed a carry- t is time, or latency to execute an algorithm, and S is space, select adder; we present our design for such an adder in detail subscripted with the name of the algorithm or circuit subrou- in section IIIB. Draper et al. have recently proposed a carry- tine. When t or S is superscripted with AC or NT C, the lookahead adder, and Gossett a carry-save adder. Beauregard values are for the latency of the construct on that architecture. has proposed a circuit that operates primarily in the Fourier Equations without superscripts are for an abstract machine as- transform space. suming no concurrency, equivalent to a total gate count for Carry-lookahead (sec. IIC5), conditional-sum the AC architecture. R is the number of calls to a subroutine, (sec. IIIB2), and carry-save (sec. IIC4) all reach O(log n) subscripted with the name of the routine. performance for addition. Carry-lookahead and conditional- m, g, f, p, b, and s are parameters that determine the be- sum use more space than carry-ripple, but much less than havior of portions of our modular exponentiation algorithm. carry-save. However, carry-save adders can be combined m, g, and f are part of our carry-select/conditional-sum adder into fast multipliers more easily. We will see in sec. III how (sec. IIIB). p and b are used in our indirection scheme to combine carry-lookahead and conditional-sum into the (sec. IIIE). s is the number of multiplier blocks we can fit overall exponentiation algorithms. into a chosen amount of space (sec. IIIC). Here we summarize the techniques which are detailed in following subsections. Our fast modular exponentiation cir- 1. VBE Carry-Ripple cuit is built using the following optimizations: The VBE algorithm [7] builds full modular exponentiation Select correct qubit layout and subsequences to imple- from smaller building blocks. The bulk of the time is spent • ment gates, then hand optimize (no penalty) [22, 23, 24, in 20n2 5n ADDERs [43]. The full circuit requires 7n +1 25, 26, 27, 28]. qubits of− storage: 2n +1 for a, n for the other multiplicand, Look for concurrency within addition/multiplication n for a running sum, n for the convolution products, n for a • (no space penalty, maybe noise penalty) (secs. IIIA). copy of N, and n for carries. In this algorithm, the values to be added in, the convolu- Select multiplicand using table/indirection (exponen- tion partial products of xa, are programmed into a temporary • tial classical cost, linear reduction in quantum gate register (combined with a superposition of 0 as necessary) count)([29], sec. IIIE). based on a control line and a data bit via appropriate| i CCNOT 4 gates. The latency of ADDER and the complete algorithm are Unfortunately, the paper’s second contribution, Gossett’s carry-ripple adder, as drawn in his figure 7, seems to be incor- rect. Once fixed, his circuit optimizes to be similar to VBE. tADD = (4n 4;4n 3;0) (6) − −

5. Carry-Lookahead 2 tV = (20n 5n)tADD − = (80n3 100n2 + 20n; 96n3 84n2 + 15n; Draper, Kutin, Rains, and Svore have recently proposed − − a carry-lookahead adder, which we call QCLA [32]. This 8n2 2n + 1) (7) − method allows the latency of an adder to drop to O(log n) for AC architectures. The latency and storage of their adder is 2. BCDP Carry-Ripple AC tLA = (4 log2 n +3; 4; 2) The BCDP algorithm is also based on a carry-ripple adder. #(n; 4n log n 1) (9) It differs from VBE in that it more aggressively takes advan- − − tage of classical computation. However, for our purposes, this The authors do not present a complete modular exponentia- makes it harder to use some of the optimization techniques tion circuit; we will use their adder in our algorithm E, which presented here. Beckman et al. present several optimiza- we evaluate only for AC. The large distances between gate tions and tradeoffs of space and time, slightly complicating operands make it appear that QCLA is unattractive for NT C. the analysis. The exact sequence of gates to be applied is also dependent on the input values of N and x, making it less suitable for 6. Beauregard/Draper QFT-based Exponentiation hardware implementation with fixed gates (e.g., in an optical system). In the form we analyze, it requires 5n +3 qubits, including 2n +1 for a . Borrowing from their equation 6.23, Beauregard has designed a circuit for doing modular ex- | i ponentiation in only 2n + 3 qubits of space [33], based 3 2 tB = (54n 127n + 108n 29; on Draper’s clever method for doing addition on Fourier- − − 10n3 + 15n2 38n + 14; transformed representations of numbers [34]. 3 3 2 − The depth of Beauregard’s circuit is O(n ), the same as 20n 38n + 22n 4) (8) − − VBE and BCDP. However, we believe the constant factors on this circuit are very large; every modulo addition consists of 3. Cuccaro Carry-Ripple four Fourier transforms and five Fourier additions. Fowler, Devitt, and Hollenberg have simulated Shor’s algo- rithm using Beauregard’s algorithm, for a class of machine Cuccaro et al. have recently introduced a carry-ripple cir- they call linear nearest neighbor (LNN) [35, 36]. LNN cuit, which we will call CUCA, which uses only a single an- corresponds approximately to our NT C. In their implemen- cilla qubit [31]. The latency of their adder is (2n 1; 5; 0) tation of the algorithm, they found no significant change in for the AC architecture. − the computational complexity of the algorithm on LNN or The authors do not present a complete modular exponentia- an AC-like abstract architecture, suggesting that the perfor- tion circuit; we will use their adder in our algorithms F and G. mance of Draper’s adder, like a carry-ripple adder, is essen- This adder, we will see in section IVC1, is the most efficient tially architecture-independent. known for NT C architectures.

4. Gossett Carry-Save and Carry-Ripple III. RESULTS: ALGORITHMIC OPTIMIZATIONS

Gossett’s arithmetic is pure quantum, as opposed to the We present our concurrent variant of VBE, then move to mixed classical-quantum of BCDP. Gossett does not provide a faster adders. This is followed by methods for performing ex- full modular exponentiation circuit, only adders, multipliers, ponentiation concurrently, improving the modulo arithmetic, and a modular adder based on the important classical tech- and indirection to reduce the number of quantum multiplica- niques of carry-save arithmetic [10]. tions. Gossett’s carry-save adder, the important contribution of the paper, can run in O(log n) time on AC architectures. It will remain impractical for the foreseeable future due to the A. Concurrent VBE large number of qubits required; Gossett estimates 8n2 qubits for a full multiplier, which would run in O(log2 n) time. It In figure 2, we show a three-bit concurrent version of the bears further analysis because of its high speed and resem- VBE ADDER. This figure shows that the delay of the con- blance to standard fast classical multipliers. current ADDER is (3n 3)CCNOT + (2n 3)CNOT , or − − 5

x x x cin a x x 0 b0 s0 k0 cin= 1 0 k'0 cin= 0 0

a1 s1 b1 k1 cin= 1 { 0 s'1 0 s1 k'1 cin= 0 { 0

a2 s2 b2 k2 cin= 1 { 0 s'2 FIG. 2: Three-bit concurrent VBE ADDER, AC abstract machine. 0 s2 Gates marked with an ’x’ can be deleted when the carry in is known k'2 cin= 0 { 0 c to be zero. out CSLA MUX

FIG. 3: Three-bit carry-select adder (CSLA) with multiplexer (MUX). ai and bi are addends. The control-SWAP gates in the MUX AC t = (3n 3; 2n 3; 0) (10) select either the qubits marked cin = 1 or cin = 0 depending on the ADD − − state of the carry in qubit cin. si qubits are the output sum and ki are a mere 25% reduction in latency compared to the unoptimized internal carries. (4n 4;4n 3;0) of equation 6. − − Adapting equation 7, the total circuit latency, minus a few m small corrections that fall outside the ADDER block proper, a −a s0−sm−1 is 0 m−1 c b −b ADD out tAC = (20n2 5n)tAC 0 m−1 V − ADD m 3 2 m+1 sm−s2m−1 = (60n 75n + 15n; am −a2m−1

3 − 2 MUX 40n 70n + 15n; 0) (11) m+1 − bm −b2m−1 CSLA This equation is used to create the first entry in table II. m m+1 s2m−s3m−1 a2m −a3m−1 MUX b −b m+1

B. Carry-Select and Conditional-Sum Adders 2m 3m−1 CSLA m m+1 s3m−s4m−1 Carry-select adders concurrently calculate possible results a3m −a4m−1 MUX without knowing the value of the carry in. Once the carry in b −b m+1 becomes available, the correct output value is selected using 3m 4m−1 CSLA a multiplexer (MUX). The type of MUX determines whether the behavior is O(√n) or O(log n). FIG. 4: Block-level diagram of four-group carry-select adder. ai and bi are addends and si is the sum. Additional ancillae not shown.

1. O(√n) Carry-Select Adder zero or one. The portion on the right is a MUX used to se- The bits are divided into g groups of m bits each, n = gm. lect which carry to use, based on the carry in. All of the out- The adder block we will call CSLA, and the combined adder, puts without labels are ancillae to be garbage collected. It is MUXes, and adder undo to clean our ancillae, CSLAMU. The possible that a design optimized for space could reuse some CSLAs are all executed concurrently, then the output MUXes of those qubits; as drawn a full carry-select circuit requires are cascaded, as shown in figure 4. The first group may have 5m 1 qubits to add two m-bit numbers. − a different size, f, than m, since it will be faster, but for the The larger m-bit carry-select adder can be constructed so moment we assume they are the same. that its internal delay, as in a normal carry-ripple adder, is Figure 3 shows a three-bit carry-select adder. This gener- one additional CCNOT for each bit, although the total num- ates two possible results, assuming that the carry in will be ber of gates increases and the distance between gate operands 6

c increases. 0,0 c The latency for the CSLA block is 0,1 c1,0 c AC 1,1 tCS = (m; 2; 0) (12) e1,0

e1,1

Note that this is not a “clean” adder; we still have ancillae to c2,0 return to the initial state. c2,1 c The problem for implementation will be creating an effi- 3,0 c cient MUX, especially on NT C. Figure 4 makes it clear that 3,1 the total carry-select adder is only faster if the latency of MUX is substantially less than the latency of the full carry-ripple. It e3,0 e will be difficult for this to be more efficient that the single- 3,1 c CCNOT delay of the basic VBE carry-ripple adder on NT C. 4,0 c On AC, it is certainly easy to see how the MUX can use a 4,1 c5,0 fanout tree consisting of more ancillae and CNOT gates to c5,1 distribute the carry in signal, as suggested by Moore [12], al- e5,0 e lowing all MUX Fredkin gates to be executed concurrently. A 5,1 c full fanout requires an extra m qubits in each adder. 6,0 c In order to unwind the ancillae to reuse them, the simplest 6,1 c7,0 approach is the use of CNOT gates to copy our result to an- c7,1 other n-bit register, then a reversal of the circuitry. Count- ing the copy out for ancilla management, we can simplify the MUX to two CCNOTs and a pair of NOTs. The latency of the carry ripple from MUX to MUX (not e7,0 qubit to qubit) can be arranged to give a MUX cost of (4g + e7,1 2m 6;0;2g 2). This cost can be accelerated somewhat − − by using a few extra qubits and “fanning out” the carry. For FIG. 5: O(log n) MUX for conditional-sum adder, for g = 9 (the intermediate values of m, we will use a fanout of 4 on AC, first group is not shown). Only the ci,j carry out lines from each m- reducing the MUX latency to (4g + m/2 6;2;2g 2) in qubit block are shown, where i is the block number and j is the carry exchange for 3 extra qubits in each group. − − in value. At each stage, the span of correct effective swap control Our space used for the full, clean adder is (6m 1)(g lines ei,j doubles. After using the swap control lines, all but the last 1)+3f +4g when using a fanout of 4. − − must be cleaned by reversing the circuit. Unlabeled lines are ancillae The total latency of the CSLA, MUX, and the CSLA undo to be cleaned. is

AC AC AC tSEM = 2tCS + tMUX group, the first group, will be a carry-rippleadder and will cre- = (4g +5m/2 6; 6; 2g 2) (13) ate the carry in; that carry in will be distributed concurrently − − in a separate tree. Optimizing for AC, based on equation 13, the delay will be the minimum when m 8n/5. The total adder latency will be Zalka was the first to∼ propose use of a carry-select adder, though he did not refer top it by name [11]. His analysis does tAC = 2tAC + not include an exact circuit, and his results differ slightly from CSUM CS (2 log (g 1) 1) (2; 0; 2) ours. ⌈ 2 − ⌉− × +(4; 0; 4)

= (2m +4 log2(g 1) +2; 4; 2. O(log n) Conditional Sum Adder ⌈ − ⌉ 4 log (g 1) + 2) (14) ⌈ 2 − ⌉ As described above, the carry-select adder is O(m + g), for n = mg, which minimizes to be O(√n). To reach O(log n) where x indicates the smallest not smaller than x. performance, we must add a multi-level MUX to our carry- ⌈ ⌉ select adder. This structure is called a conditional sum adder, For large n, this generally reaches a minimum for small m, which gives asymptotic behavior 4log n, the same as which we will label CSUM. Rather than repeatedly choosing ∼ 2 bits at each level of the MUX, we will create a multi-level dis- QCLA. CSUM is noticeably faster for small n, but requires tribution of MUX select signals, then apply them once at the more space. end. Figure 5 shows only the carry signals for eight CSLA The MUX uses 3(g 1)/2 2 qubits in addition to the groups. The e signals in the figure are our effective swap internal carries and⌈ the− tree for⌉− dispersing the carry in. Our control signals. They are combined with a carry in signal to space used for the full, clean adder is (6m 1)(g 1)+3f + control the actual swap of variables. In a full circuit, a ninth 3(g 1)/2 2 + (n f)/2 . − − ⌈ − − − ⌉ 7

|x> a3 a 2 |0> a1 a0 adder

|sum> adder adder |overflow>

xa QSET CQMMULT CQMMULT CQMMULT QQMMULT FIG. 7: More efficient modulo adder. The blocks with arrows set the register contents based on the value of the control line. The position a 7 of the black block indicates the running sum in our output. a6 a5 a4 last two of the five blocks are required to undo the overflow bit. Figure 7 shows a more efficient modulo adder than VBE, QSET CQMMULT CQMMULT CQMMULT based partly on ideas from BCDP and Gossett. It requires only three adder blocks, compared to five for VBE, to do one FIG. 6: Concurrent modular multiplication in modular exponentia- modulo addition. The first adder adds xj to our running sum. tion for s = 2. QSET simply sets the sum register to the appropriate n j n j value. The second conditionally adds 2 x N or 2 x , de- pending on the value of the overflow− bit,−without affecting− the overflow bit, arranging it so that the third addition of xj will C. Concurrent Exponentiation overflow and clear the overflow bit if necessary. The blocks pointed to by arrows are the addend register, whose value is Modular exponentiation is often drawn as a string of mod- set depending on the control lines. Figure 7 uses n fewer bits ular multiplications, but Cleve and Watrous pointed out that than VBE’s modulo arithmetic, as it does not require a register these can easily be parallelized, at linear cost in space [9]. We to hold N. In a slightly different fashion, we can improve the perfor- always have to execute 2n multiplications; the goal is to do them in as few time-delays as possible. mance of VBE by adding a number of qubits, p, to our result To go (almost) twice as fast, use two multipliers. For four register, and postponing the until later. This works as long as we don’t allow the result register to overflow; times, use four. Naturally, this can be built up to n multipliers we have a redundant representation of modulo values, but to multiply the necessary 2n +1 numbers, in which case a N that is not a problem at this stage of the computation. tree recombining the partial results requires log2 n quantum- quantum (Q-Q) multiplier latency times. The first unit in each The largest number that doesn’t overflow for p extra qubits is 2n+p 1; the largest number that doesn’t result in subtrac- chain just sets the register to the appropriate value if the con- n−+p−1 trol line is 1, otherwise, it leaves it as 1. tion is 2 1. We want to guarantee that we always clear that high-order− bit, so if we subtract bN, the most iterations For s multipliers, s n, each multiplier must combine we can go before the next subtraction is b. r = (2n + 1)/s or r +1≤ numbers, using r 1 or r multi- The largest multiple of we can subtract is n+p−1 . plications⌊ (the first⌋ number being simply set into− the running N 2 /N Since 2n−1

D. Reducing the Cost of Modulo Operations E. Indirection

The VBE algorithm does a trial subtraction of N in each We have shown elsewhere that it is possible to build a table modulo addition block; if that underflows, N is added back in containing small powers of x, from which an argument to a to the total. This accounts for two of the five ADDER blocks multiplier is selected [29]. In exchange for adding storage and much of the extra logic to compose a modulo adder. The space for 2w n-bit entries in a table, we can reduce the number 8

3 2 1 0 3 2 1 0 x x x x x x x x x x x | a1> | a2> | a3> | a1> | a0> | a1> | a2> | a0> |enable> | a0> | a1> w=2 |0> |tmp=0> | a0> |enable> |tmp=0>

|sum> adder w=3 |tmp2=0> |overflow> |enable> w=4 FIG. 8: Implicit Indirection. The arrows pointing to blocks indicate the setting of the addend register based on the control lines. This FIG. 9: Argument setting for indirection for different values of w, sets the addend from a table stored in classical memory, reducing the for the AC architecture. For the w = 4 case, the two CCNOTs on number of quantum multiplications by a factor of w in exchange for the left can be executed concurrently, as can the two on the right, for w 2 argument setting operations. a total latency of 3. of multiplications necessary by a factor of w. This appears to begins to kick in. Thus, in managing space tradeoffs, this be attractive for small values of w, suchas2 or3. will be our standard; any technique that raises performance by In our prior work, we proposed using a large quan- more than a factor of c in exchange for c times as much space tum memory, or a quantum-addressable classical memory will be used preferentially to parallel multiplication. Carry- (QACM) [37]. Here we show that the quantum storage space select adders (sec. IIIB) easily meet this criterion, being per- need not grow; we can implicitly perform the lookup by haps six times faster for less than twice the space. choosing which gates to apply while setting the argument. In Algorithm D uses 100n space and our conditional-sum figure 8, we show the setting and resetting of the argument for adder CSUM. Algorithm E uses 100n space and the carry- w = 2, where the arrows indicate CCNOTs to set the appro- lookahead adder QCLA. Algorithms F and G use the Cuc- priate bits of the 0 register to 1. The actual implementation can caro adder and 100n and minimal space, respectively. Pa- use a calculated enable bit to reduce the CCNOTs to CNOTs. rameters for these algorithms are shown in table I. We have Only one of the values x0, x1, x2, or x3 will be enabled, based included detailed equations for concurrent VBE and D and on the value of a1a0 . numeric results in table II. The performance ratios are based The setting of| thisi input register may require propagating only on the CCNOT gate count for AC, and only on the a or the enable bit across the entire register. Use of a few − CNOT gate count for NT C. |extrai qubits (2w 1) will allow the several setting operations to propagate in a tree.

w A. Concurrent VBE AC 2 (1;0;1) = (4;0;4) w =2 tARG = w (18) (2 (3;0;1) w =3, 4 On AC, the concurrent VBE ADDER is (3n 3;2n For w = 2 and w = 3, we calculate that setting the argu- 3;0) = (381;253;0) for 128 bits. This is the value− we use in− ment adds (4; 0; 4)#(4, 5) and (24; 0; 8)#(8, 9), respectively, the concurrentVBE line in table II. This will serve as our best to the latency, concurrency and storage of each adder. We baseline time for comparing the effectiveness of more drastic create separate enable signals for each of the 2w possible ar- algorithmic surgery. guments and pipeline flowing them across the register to set Figure 10 shows a fully optimized, concurrent, but other- the addend bits. We consider this cost only when using indi- wise unmodified version of the VBE ADDER for three bits on rection. Figure 9 shows circuits for w =2, 3, 4. a neighbor-only machine (NT C architecture), with the gates Adapting equation 15 to both indirection and concurrent marked ’x’ in figure 2 eliminated. The latency is multiplication, we have a total latency for our circuit, in mul- tiplier calls, of tNTC = (20n 15;0)#(2; 3n + 1) (20) ADD − RI =2r+1+ log ( (s 2n 1+rs)/4 +2n+1 rs) (19) ⌈ 2 ⌈ − − ⌉ − ⌉ or 45 gate times for the three-bit adder. A 128-bit adder will where r = (2n + 1)/w /s . have a latency of (2545; 0). The diagram shows a concurrency ⌊⌈ ⌉ ⌋ level of three, but simple adjustment of execution time slots can limit that to two for any n, with no latency penalty. IV. EXAMPLE: EXPONENTIATING A 128-BIT NUMBER The unmodified full VBE modular exponentiation algo- rithm, consists of 20n2 5n = 327040 ADDER calls plus − In this section, we combine these techniques into complete minor additional logic. algorithms and examine the performance of modular expo- nentiation of a 128-bit number. We assume the primary en- NTC 2 NTC gineering constraint is the available number of qubits. In sec- tV = (20n 5n)tADD 3− 2 tion IIIC we showed that using twice as much space can al- = (400n 400n + 75n; 0) (21) − most double our speed, essentially linearly until the log term 9

algorithm adder modulo indirect multipliers (s) space concurrency concurrent VBE VBE VBE N/A 1 897 2 algorithm D CSUM(m = 4) p = 11,b = 1024 w = 2 12 11969 126 12 = 1512 algorithm E QCLA p = 10,b = 512 w = 2 16 12657 128 × 16 = 2048 algorithm F CUCA p = 10,b = 512 w = 4 20 11077 ×20 2 = 40 algorithm G CUCA fig. 7 w = 4 1 660 × 2

TABLE I: Parameters for our algorithms, chosen for 128 bits.

algorithm AC NTC gates perf. gates perf. concurrent VBE (1.25 108; 8.27 107; 0.00 100) 1.0 (8.32 108; 0.00 100) 1.0 algorithm D (2.19 × 105; 2.57 × 104; 1.67 × 105) 569.8 × × N/A N/A algorithm E (1.71 × 105; 1.96 × 104; 2.93 × 104) 727.2 N/A N/A algorithm F (7.84 × 105; 1.30 × 104; 4.10 × 104) 158.9 (4.11 106; 4.10 104) 202.5 algorithm G (1.50 × 107; 2.48 × 105; 7.93 × 105) 8.3 (7.87 × 107; 7.93 × 105) 10.6 × × × × × TABLE II: Latency to factor a 128-bit number for various architectures and choices of algorithm. AC, abstract concurrent architecture. NTC neighbor-only, two-qubit gate, concurrent architecture. perf, performance relative to VBE algorithm for that architecture, based on CCNOTs for AC and CNOTs for NTC.

B. Algorithm D is slightly faster than QCLA, its significantly larger space con- sumption means that in our 100n fixed-space analysis, we can The overall structure of algorithm D is similar to VBE, with fit in 16 multipliers using QCLA, compared to only 12 using our conditional-sum adders instead of the VBE carry-ripple, CSUM. This allows the overall algorithm E to be 28% faster and our improvements in indirection and modulo. As we do than D for 128 bits. not consider CSUM to be a good candidate for an algorithm for NT C, we evaluate only for AC. Algorithm D is the fastest algorithm for n =8 and n = 16. 1. Algorithms F and G

tD = RI RM The Cuccaro carry-rippler adder has a latency of (10n + 5; 0) for NT C. This is twice as fast as the VBE adder. We (tCSUM + tARG) × use this in our algorithms F and G. Algorithm F uses 100n +3ptCSUM (22) space, while G is our attempt to produce the fastest algorithm Letting r = (2n + 1)/w /s , the latency and space re- in the minimum space. quirements for algorithm⌊⌈ D are⌉ ⌋ tAC = 2r +1+ log ( (s 2n 1+ rs)/4 D ⌈ 2 ⌈ − − ⌉ D. Smaller n and Different Space +2n +1 rs) n(2b + 1)/b − ⌉ ((2m +4 log (g 1) +2; 4; × ⌈ 2 − ⌉ Figure 11 shows the execution times of our three fastest al- 4 log2(g 1) +2)+(4; 0; 4)) gorithms for n from eight to 128 bits. Algorithm D, using ⌈ − ⌉ CSUM, is the fastest for eight and 16 bits, while E, using +3p(2m +4 log2(g 1) +2; 4; ⌈ − ⌉ QCLA, is fastest for larger values. The latency of 1072 for 4 log (g 1) + 2) (23) ⌈ 2 − ⌉ n = 8 bits is 32 times faster than concurrent VBE, achieved and with 60n = 480 qubits of space. Figure 12 shows the execution times for n = 128 bits for SD = s(SCSUM variousamounts of available space. All of our algorithmshave +2w +1+ p + n)+2n +1 reached a minimum by 240n space (roughly 1.9n2). = s(7n 3m g +2w + p − − + 3(g 1)/2 2 + (n m)/2 ) ⌈ − − − ⌉ +2n +1 (24) E. Asymptotic Behavior

The focus of this paper is the constant factors in modular C. AlgorithmE exponentiation for important problem sizes and architectural characteristics. However, let us look briefly at the asymptotic Algorithm E uses the carry-lookahead adder QCLA in behavior of our circuit depth. place of the conditional-sum adder CSUM. Although CSUM In section IIIC, we showed that the latency of our complete 10

a0

b0

c0

a1

b1

c1

a2

b2

cout 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

FIG. 10: Optimized, concurrent three bit VBE ADDER for the NTC architecture. Numbers across the bottom are time steps.

Our is still 106 O(n) latencyofaddition (26) × Algorithms D and E both use an O(log n) adder. Combin- 105 ing equations 25 and 26 with the adder cost, we have asymp- totic circuit depth of

AC AC tD = tE = O((n log n)(n/s + log s)) (27) 104

Latency (CCNOT Gate Count) for algorithms D and E. As s n, these approach O(n log2 n) and space consumed approaches→ O(n2).

3 Algorithm F uses an O(n) adder, whose asymptotic behav- 10 algo D algo E ior is the same on both AC and NT C, giving algo F 8 16 32 64 128 AC NTC 2 Length (bits) tF = tF = O((n )(n/s + log s)) (28) 2 FIG. 11: Execution time for our algorithms for space 100n on the approaching O(n log n) as space consumed approaches AC architecture, for varying value of n. O(n2). This compares to asymptotic behavior of O(n3) for VBE, BCDP, and algorithm G, using O(n) space. The limit of per- algo D algo E formance, using a carry-save multiplier and large s, will be algo F 107 O(log3 n) in O(n3) space.

V. DISCUSSION AND FUTURE WORK

106 We have shown that it is possible to significantly acceler- ate quantum modular exponentiation using a stable of tech- niques. We have provided exact gate counts, rather than Latency (CCNOT Gate Count) asymptotic behavior, for the n = 128 case, showing algo- rithms that are faster by a factor of 200 to 700, depending on 105 architectural features, when 100n qubits of storage are avail- able. For n = 1024, this advantage grows to more than a 0 50 100 150 200 250 300 Space (multiple of n) factor of 5,000 for non-neighbor machines (AC). Neighbor- only (NT C) machines can run algorithms such as addition FIG. 12: Execution time for our algorithms for 128 bits on the AC in O(n) time at best, when non-neighbor machines (AC) can architecture, for varying multiples of n space available. achieve O(log n) performance. In this work, our contribution has focused on parallelizing execution of the arithmetic through improved adders, concur- algorithm is rent gate execution, and overall algorithmic structure. We have also made improvements that resulted in the reduction O(n/s + log s) latency of multiplication (25) of modulo operations, and traded some classical for quantum × computation to reduce the number of quantum operations. It as we parallelize the multiplication using s multiplier blocks. seems likely that further improvements can be found in the 11 overall structure and by more closely examining the construc- begun work on expanding our model definitions, as well as tion of multipliers from adders [30]. We also intend to pursue additional ways to characterize quantum computer architec- multipliers built from hybrid carry-save adders. tures. The three factors which most heavily influence perfor- The process of designing a large-scale quantum computer mance of modular exponentiation are, in order, concurrency, has only just begun. Over the coming years, we expect ad- the availability of large numbers of application-level qubits, vances in the fundamental technology, the system architec- and the topology of the interconnection between qubits. With- ture, algorithms, and tools such as compilers to all contribute out concurrency, it is of course impossible to parallelize the to the creation of viable machines. Our execution of any algorithm. Our algorithms can use up to hope is that the algorithms and techniques in this paper will 2n2 application-level qubits to execute the multiplications contribute to that engineering process in both the short and ∼ in parallel, executing O(n) multiplications in O(log n) time long term. steps. Finally, if any two qubits can be operands to a quantum gate, regardless of location, the propagation of information about the carry allows an additionto be completedin O(log n) time steps instead of O(n). We expect that these three factors Acknowledgments will influence the performance of other algorithms in similar fashion. The authors would like to thank Eisuke Abe, Fumiko Yam- Not all physically realizable architectures map cleanly to aguchi, and Kevin Binkley of Keio University, Thaddeus Ladd one of our models. A full two-dimensionalmesh, such as neu- of Stanford University, Seth Lloyd of MIT, Y. Kawano and Y. tral atoms in an [38], and a loose trellis topol- Takahashi of NTT Basic Research Laboratories, Bill Munro ogy [39] probably fall between AC and NT C. The behavior of HP Labs, Kae Nemoto of NII, and the referee for helpful of the scalable ion trap [40] is not immediately clear. We have discussions and feedback.

[1] P. W. Shor, SIAM J. Comp. 26, 1484 (1997). [19] P. W. Shor, in Proc. 37th Symposium on Foundations of Com- [2] L. Grover, in Proc. 28th Annual ACM Symposium on the Theory puter Science (IEEE Computer Society Press, Los Alamitos, of Computation (1996), pp. 212–219. CA, 1996), pp. 56–65. [3] D. Deutsch and R. Jozsa, Proc. R. Soc. London Ser. A, 439, 553 [20] A. M. Steane and B. Ibinson, Fault-tolerant logical gate (1992). networks for CSS codes, http://arXiv.org/quant-ph/0311014 [4] M. A. Nielsen and I. L. Chuang, Quantum Computation and (2003). (Cambridge University Press, 2000). [21] A. M. Steane, Quantum Information and Computation 2, 297 [5] D. Beckman, A. N. Chari, S. Devabhaktuni, and J. Preskill, (2002). Phys. Rev. A 54, 1034 (1996). [22] A. V. Aho and K. M. Svore, Compiling quantum circuits using [6] P. W. Shor, in Proc. 35th Symposium on Foundations of Com- the palindrome transform, http://arXiv.org/quant-ph/0311008 puter Science (IEEE Computer Society Press, Los Alamitos, (2003). CA, 1994), pp. 124–134. [23] G. Ahokas, R. Cleve, and L. Hales, in [41]. [7] V. Vedral, A. Barenco, and A. Ekert, Phys. Rev. A 54, 147 [24] Y. Kawano, S. Yamashita, and M. Kitagawa (2004), private (1996). communication, submitted to PRA. [8] D. E. Knuth, The Art of Computer Programming, volume 2 [25] N. Kunihiro (2004), private communication. / Seminumerical Algorithms (Addison-Wesley, Reading, MA, [26] Y. Takahashi, Y. Kawano, and M. Kitagawa, in [41]. 1998), 3rd ed. [27] L. M. K. Vandersypen, Ph.D. thesis, Stanford University [9] R. Cleve and J. Watrous, in Proc. 41st Annual Symposium on (2001). Foundations of Computer Science (ACM, 2000), pp. 526–536. [28] A. Yao, in Proceedings of the 34th Annual Symposium on Foun- [10] P. Gossett, Quantum carry-save arithmetic, dations of Computer Science (Institute of Electrical and Elec- http://arXiv.org/quant-ph/9808061 (1998). tronic Engineers Computer Society Press, Los Alamitos, CA, [11] C. Zalka, Fast versions of Shor's quantum factoring algorithm, 1993), pp. 352–361. http://arXiv.org/quant-ph/9806084 (1998). [29] R. Van Meter, in Proc. Int. Symp. on Mesoscopic Superconduc- [12] C. Moore and M. Nilsson, SIAM J. Computing 31, 799 (2001). tivity and Spintronics (MS+S2004) (2004). [13] D. Loss and D. P. DiVincenzo, Phys. Rev. A 57, 120 (1998). [30] M. D. Ercegovac and T. Lang, Digital Arithmetic (Morgan [14] B. E. Kane, Nature 393, 133 (1998). Kaufmann, San Francisco, CA, 2004). [15] T. D. Ladd, J. R. Goldman, F. Yamaguchi, Y. Yamamoto, [31] S. A. Cuccaro, T. G. Draper, S. A. Kutin, and D. P. E. Abe, and K. M. Itoh, Physical Review Letters 89, 17901 Moulton, A new quantum ripple-carry addition circuit, (2002). http://arXiv.org/quant-ph/0410184 (2004). [16] Y. A. Pashkin, T. Yamamoto, O. Astafiev, Y. Nakamura, D. V. [32] T. G. Draper, S. A. Kutin, E. M. Rains, and K. M. Averin, and J. S. Tsai, Nature 421, 823 (2003). Svore, A logarithmic-depth quantum carry-lookahead adder, [17] J. Q. You, J. S. Tsai, and F. Nori, Phys. Rev. Lett. 89 (2002). http://arXiv.org/quant-ph/0406142 (2004). [18] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Mar- [33] S. Beauregard, Quantum Information and Computation 3, 175 golus, P. Shor, T. Sleator, J. Smolin, and H. Weinfurter, Phys. (2003). Rev. A 52, 3457 (1995). [34] T. G. Draper, Addition on a quantum computer, 12

http://arXiv.org/quant-ph/0008033 (2000), first draft dated Symposium on Computer Architecture (ACM, 2003). Sept. 1998. [40] D. Kielpinski, C. Monroe, and D. J. Wineland, Nature 417, 709 [35] A. G. Fowler, S. J. Devitt, and L. C. Hollenberg, Quantum In- (2002). formation and Computing 4, 237 (2004). [41] Proc. ERATO Conference on Quantum Information Science [36] S. J. Devitt, A. G. Fowler, and L. C. Hollenberg, Simulations (EQIS2003) (2003). of Shor's algorithm with implications to scaling and quantum [42] Shor noted this in his original paper, without explicitly specify- error correction, http://arXiv.org/quant-ph/0408081 (2004). ing a bound. Note also that this bound is for a Turing machine; [37] M. A. Nielsen and I. L. Chuang, Quantum Computation and a random-access machine can reach O(n log n). Quantum Information (Cambridge University Press, 2000), pp. [43] When we write ADDER in all capital letters, we mean the com- 266–268. plete VBE n-bit construction, with the necessary undo; when [38] G. K. Brennen, C. M. Caves, P. S. Jessen, and I. H. Deutsch, we write adder in small letters, we are usually referring to a Physical Review Letters 82, 1060 (1999). smaller or generic circuit block. [39] M. Oskin, F. T. Chong, I. L. Chuang, and J. Kubiatowicz, in Computer Architecture News, Proc. 30th Annual International