Arxiv:2103.13855V2 [Quant-Ph] 18 Aug 2021 System [10]

Demonstration of Shor’s factoring algorithm for N=21 on IBM quantum processors

Unathi Skosana∗ and Mark Tame Department of Physics, Stellenbosch University, Matieland 7602, South Africa (Dated: August 19, 2021) We report a proof-of-concept demonstration of a quantum order-finding algorithm for factoring the integer 21. Our demonstration involves the use of a compiled version of the quantum phase estimation routine, and builds upon a previous demonstration by Mart´ın-López et al. in Nature Photonics 6, 773 (2012). We go beyond this work by using a configuration of approximate Toffoli gates with residual phase shifts, which preserves the functional correctness and allows us to achieve a complete factoring of N = 21. We implemented the algorithm on IBM quantum processors using only 5 qubits and successfully verified the presence of entanglement between the control and work register qubits, which is a necessary condition for the algorithm’s speedup in general. The techniques we employ may be useful in carrying out Shor’s algorithm for larger integers, or other algorithms in systems with a limited number of noisy qubits.

I. INTRODUCTION tions and combining the outcomes [14]. Recently, build- ing on previous schemes of hybrid factorization [15, 16], a Shor’s algorithm [1] is a quantum algorithm that pro- quantum-classical hybrid scheme has been implemented vides a way of finding the nontrivial factors of an L-bit on IBM’s quantum processors for the prime factorization odd composite integer N = pq in polynomial time with of 35. This hybrid scheme of factorization alleviates the high probability. The crux of Shor’s algorithm rests upon resource requirements of the algorithm at the expense of Quantum Phase Estimation (QPE) [2], which is a quan- performing part of the factoring classically [17]. tum routine that estimates the phase ϕu of an eigenvalue In this paper, we build on the order-finding routine of e2πiϕu corresponding to an eigenvector |ui for some uni- Ref. [11] and implement a version of Shor’s algorithm for tary matrix Uˆ. QPE efficiently solves a problem related factoring 21 using only 5 qubits – the work register con- to factoring, known as the order-finding problem, in poly- tains 2 qubits and the control register contains 3 qubits, nomial time in the number of bits needed to specify the each providing 1-bit of accuracy in the resolution of the peaks in the output probability distribution used to find problem, which in this case is L = dlog2 Ne. By solving the order-finding problem using QPE and carrying out the order. This approach is in contrast to the iterative a few extra steps, one can factor the integer N. There version [18] used in Refs. [11] and [14], which employs a is no known classical algorithm that can solve the same single qubit that is recycled through measurement and problem in polynomial time [2,3]. feed-forward, giving 1-bit of accuracy each time it is re- A large corpus of work has been done with regards cycled. The advantage of the iterative approach lies in to the experimental realization of Shor’s algorithm over this very reason; through mid-circuit measurement and the years. The pioneering work was performed with real-time conditional feed-forward operations, the total liquid-state nuclear magnetic resonance, factoring 15 on number of qubits required by the algorithm is signifi- a 7-qubit quantum computer [4]. The considerable re- cantly reduced. At the time of writing, IBM’s quantum source demands of Shor’s original algorithm were cir- processors do not yet support real-time conditionals nec- cumvented by using various approaches, including adi- essary for the implementation of the iterative approach, abatic quantum computing [5] and in the standard net- so we use 3 qubits for the control register, one for each work model using techniques of compilation [6] that re- effective iteration. Thus, our compact approach is com- duced the demands to within the reach of single-photon pletely equivalent to the iterative approach. In future, architectures [7–9] and a super-conducting phase qubit once the capability of performing real-time conditionals is added, a further reduction in resources will be possible arXiv:2103.13855v2 [quant-ph] 18 Aug 2021 system [10]. In 2012, a proof-of-concept demonstration of the order-finding algorithm for the integer 21 was for our implementation, potentially improving the qual- carried out with photonic qubits using, in addition to ity of the results even more and opening up the possibility the aforementioned compilation technique, an iterative of factoring larger integers. scheme [11], where the control register is reduced to one As it stands, the controlled-NOT (CX) gate count of qubit and this qubit is reset and reused [12, 13]. How- the standard algorithm [19] exceeds 40 and in prelim- ever, factoring was not possible in this demonstration inary tests we have found that the output probability due to the low number of iterations. Later, the iterative distribution is indistinguishable from a uniform proba- scheme was demonstrated for factoring 15, 21 and 35 bility distribution (noise) on the IBM quantum proces- on an IBM quantum processor by splitting up the itera- sors. Our improved version reduces the CX gate count through the use of relative phase Toffoli gates, reducing the CX gate count by half while leaving the overall operation of the circuit unchanged and we suspect this ∗ [email protected] technique may extend beyond the case considered here. 2

We have gone further than the work in Ref. [11], where full factorization of 21 was not achieved as with only two bits of accuracy for the peaks of the output probability distribution continued fractions would fail to extract the correct order. On the other hand, in the work in Ref. [14], where 21 was factored on an IBM processor, a larger number of 6 qubits was required and the iterations FIG. 1. Schematic of the routine used for the period finding were split into three separate circuits, with the need to part of Shor’s algorithm. The first (control) register has n re-initialise the work register into specific quantum states qubits. The number of qubits in the control register deter- for each iteration. Our approach is thus more efficient mines the bit-accuracy of the value of 2ns/r. The bottom and compact, enabling algorithm outcomes with reduced (work) register has the m qubits required to encode N. First, noise. To support our claims, we successfully carry out the control and work registers are initialized, then conditional continued fractions and evaluate the performance of the modular exponentiation is performed, indicated by the con- algorithm by (i) quantitatively comparing the measured trolled unitary and an inverse quantum Fourier transform is applied to the control register followed by a standard com- probability distribution with the ideal distribution and putational basis measurement. The circuit is essentially the noise via the Kolmogorov distance, (ii) performing state QPE algorithm applied to the unitary matrix Uâ – see text tomography experiments on the control register, and (iii) for details. verifying the presence of entanglement across both registers. The paper is organized as follows. In Sec.II, we give which is equivalent to the order-finding problem: for an a brief review of the order-finding problem and its re- L-bit positive odd integer N = pq and randomly chosen lation to Shor’s algorithm. In Sec.III we expound on positive integer a ≤ N co-prime to N, the order r of the compiled version of Shor’s algorithm, where we con- a and N can be used to find the non-trivial factors of sider the specific case of the factorization of N = 21. We N. The algorithm is probabilistically guaranteed, with construct the quantum circuits that realize the required probability greater than a half that the greatest common modular exponentiation unitaries and proceed to opti- divisor gcd(ar/2 ± 1,N) gives the prime factors of N [2]. mize their CX gate count through the introduction of Shor’s algorithm uses two quantum registers; a control relative phase Toffoli gates. We report our results from register and a work register. The control register con- executing our compact construction of the algorithm on tains n qubits, each for one bit of precision in the algo- IBM’s quantum computers in Sec.IV. Finally, we provide rithmic output. The work register contains m = dlog2 Ne concluding remarks of our study in Sec.V. An appendix qubits where m is the number of qubits to encode N. The is also included. measurement of the control register outputs a probability distribution peaked at approximately the values of 2ns/r, where s is associated with the outcome of the measure- II. BACKGROUND ment and thus randomly assigned. The details of how the peaked probability distribution comes about are given in A. Order finding the order-finding routine outlined below. One can deter- mine the order r from the peak values of the distribution The order-finding problem is typically stated as fol- using continued fractions, with a number of operations lows. Given positive integers N and a ∈ {0, 1,...,N −1} that scales polynomially in dlog2 Ne. The procedure, or that share no common factors, we seek to find the least routine, for order finding is summarized below. positive integer r ∈ {0, 1,...N} such that ar mod N = 1. The integer r is said to be the order of a and N, and the order-finding problem is that of finding r for a Order-finding routine particular a and N. There exists no classical algorithm that can solve the order-finding problem efficiently, that 1. Initialization ⊗n ⊗m is, with operations (elementary gates) that scale polyno- Prepare |0i |0i and apply H⊗n on the control mially in the number of bits needed to specify N, i.e. register and X on the mth qubit in the work register to create a superposition of 2n states in the control L ≡ dlog2 Ne [2,3]. register and |1i in the work register:

2n−1 B. Shor’s algorithm ⊗n ⊗m 1 X |0i |0i → |xi |1i . 2n/2 x=0 The order-ﬁnding problem can be eﬃciently solved on a quantum computer with O(L3) operations; the cost being mostly due to the modular exponentiation operation 2. Modular exponentiation function (MEF) which requires O(L3) quantum gates [2]. The problem Conditionally apply the unitary operation Uˆ that of prime factorization is the subject of Shor’s algorithm, implements the modular exponentiation function 3

ax modN on the work register whenever the control needed to successfully carry out the continued fractions register is in state |xi: part of the algorithm. Such an overhead obviously puts a full-scale implementation beyond the reach of current de- 2n−1 2n−1 1 X 1 X vices. However, compilation techniques such as the one |xi |1i → |xi |ax mod Ni 2n/2 2n/2 described in Ref. [20], bridge this gap and allow for small- x=0 x=0 scale proof-of-concept demonstrations, where the quan- r−1 2n−1 tum circuit is tailored around properties of the number to 1 X X 2πisx/r = √ e |xi |usi . be factored. This significantly simplifies the controlled- r2n s=0 x=0 operations that realize the MEF operation (see previous section), which is the most resource-intensive part of the order-finding routine. The resource demands of the com-

In the second line, |usi is the eigenstate of Uˆ : piled quantum circuit are significantly reduced, making r−1 it suitable for quantum devices with low connectivity. X ˆ 2πis/r √1 U |usi = e |usi and r |usi = |1i has been From Ref. [11], we extend the compiled quantum order- s=0 finding routine for the particular case of factoring N = 21 used for the work register. The MEF operation is with a = 4 to accommodate another iteration for better ˆ x equivalent to applying U to the work register when precision in the resolution of the peaks for the value of the state |xi is in the control register, as shown in 2ns/r. For the case of N = 21, other choices of a give ˆ Fig.1, with U |yi = |ay mod Ni for a given state 2, 4 or 6 for r. The cases for r = 2 or r = 4 have been |yi (the subscript a in Uˆ is suppressed for notational demonstrated for N = 15 [4,7–10] and would bear a convenience). This provides an alternative way to similar circuit structure in the present case. With only write the output state and allows a connection be- three iterations, r = 6 would be out of reach as continued tween the MEF operation and the QPE algorithm fractions would fail. For a = 4 we have r = 3, which is for the unitary operation Uˆ. a choice that does not suffer from the aforementioned reasons. Despite r being an odd integer, the algorithm is 3. Inverse Quantum Fourier Transform (QFT) successful in finding it from a = 4. This is the case for Apply the inverse quantum Fourier transform on certain choices of perfect square a and odd r, and a = 4 the control register: and r = 3 is such a case [11].

r−1 2n−1 r−1 In contrast to Ref. [11], our implementation is not iter- 1 X X 2πisx/r 1 X √ e |xi |usi → √ |ϕsi |usi . ative and uses three qubits for the control register rather n r r2 s=0 x=0 s=0 than one qubit recycled on every iteration. The iterative version is based on the recursive phase estimation, 4. Measurements made possible by the use of the semi-classical QFT [12]. Measure the control register in the computational However, we have used the traditional QFT because mid- basis, yielding peaks in the probability for states circuit measurements with real-time conditionals are not n possible yet on IBM’s quantum processors. The tradi- where ϕs ' 2 s/r due to the inverse QFT. Thus, the outcome of the algorithm is probabilistic, how- tional QFT for 3 qubits (see [2] - Box 5.1) that we imple- ever, there is a high probability of obtaining the mented is equivalent to Fig. 1A and Fig. 1B in Ref. [13]. The latter is the semi-classical QFT that makes possi- location of the ϕs peaks after only a few runs. The n ble the implementation of the iterative version of Shor. accuracy of ϕs to 2 s/r is determined by the number of qubits in the control register. If mid-circuit measurements with real-time conditionals were possible, the 3-qubit semi-classical QFT would be 5. Continued fractions possible and may improve the quality of the results we n Apply continued fractions to ϕ = ϕs/2 (the ap- present here through the use of only 1 qubit for the con- proximation of s/r) to extract out r from the control register, as in Ref. [11]. IBM has suggested that the vergents (see Appendix G for details). behaviour of real-time conditionals can be reproduced through post selection of the mid-circuit measurements. However, in the present case the speed up gained would III. COMPILED SHOR’S ALGORITHM be lost using this post selection method (see Appendix A). A full-scale implementation of Shor’s algorithm to fac- In Ref. [11], a step that is unique among the com- tor an L-bit number would require a quantum circuit pilation steps of previous demonstrations, and central with 72L3 quantum gates acting on 5L + 1 qubits for the to their demonstration is mapping the three levels |1i, order-ﬁnding routine [20], i.e. factoring N = 21 would re- |4i and |16i accessed by the possible 2L = 25 levels of quire 9000 elementary quantum gates acting on 26 qubits. the work-register to only a single qutrit system. In our The overhead in quantum gates comes from the modu- demonstration we also use this step, however IBM pro- lar exponentiation function part of the algorithm, while cessors consist of qubits and so we represent the work the overhead in qubits comes from the level of accuracy register by 3 basis states from a two-qubit system and 4 discard the fourth basis state as a null state. The states |1i 7→ |4i and |4i 7→ |16i.A CX gate controlled by |c1i encoding the three possible levels of the work register; targeting |q1i followed by a Fredkin gate, swapping |q0i 2 |1i, |4i and |16i are mapped to |q0q1i according to and |q1i realizes this simpliﬁed Uˆ .

|1i 7→ |log4 1i = |00i , |c1i • • • (4) |4i 7→ |log4 4i = |01i , |q0i = • |16i 7→ |log 16i = |10i . (1) 1 4 U 2 |q i • • Therefore instead of evaluating 4x mod 21 in the work 1 register as described in step 2 of Sec.IIB, the com- In the above, the Fredkin gate has been decomposed into piled version of Shor’s algorithm effectively evaluates a Toffoli gate (CCX) and two CX gates. The subsequent x 3 log4[4 mod 21] in its place for x = 0, 1 ... 2 − 1 [20], implementation of Uˆ 4 admits no simplifications as all the which reduces the size of the work register to 2 qubits in possible states in the work register may have non-zero comparison to the 5 qubits required in the standard con- amplitude at this point. This operation is implemented struction. Note the ordering of quantum bits in the work with a Toffoli and a Fredkin gate with single-qubit X register is |qi = |q0i |q1i, where the rightmost qubit is gates. associated with the least significant bit. Similarly, with the control register we have |ci = |c i |c i |c i. In to- 0 1 2 |c0i • • • (5) tal the algorithm requires 5 qubits: 3 for the control register and 2 for the work register. Implementing the |q0i = • ˆ x 2 controlled unitaries U that perform the modular expo- U 2 ˆ x x nentiation |xi |yi → |xi U |yi = |xi |a y mod Ni reduces |q1i X • X • • to effectively swapping around the states |1i, |4i and |16i in the work register controlled by the correspond- The full circuit diagram is shown in Fig.2 – note that ing bit of the integer x in the control register, which 0 1 2 before simplification the order of application of the con- is given by x = c22 + c12 + c02 . In other words, 2(n−1) 21 2 1 0 trolled unitaries is interchangeable, Uˆ or Uˆ could ˆ x ˆ c02 ˆ c12 ˆ c22 U = U U U . Thus, depending on the control be applied first. Interchanging the order only has the ef- qubit ci, one of the following maps is applied: fect of interchanging the order of the outcome bits at the Uˆ 1 : {|1i 7→ |4i , |4i 7→ |16i , |16i 7→ |1i}, end of the computation. This is the reason the order of application of the controlled unitaries here is in reverse Uˆ 2 : {|1i 7→ |16i , |4i 7→ |1i , |16i 7→ |4i}, order to that in Ref. [11]. Uˆ 4 : {|1i 7→ |4i , |4i 7→ |16i , |16i 7→ |1i}. (2)

The next simplification step comes from the fact that B. Modular exponentiation with relative phase these operations on the work register need not be con- Toffolis trolled SW AP (Fredkin) gates, they can be as simple as CX gates, as we show next. In total, the modular exponentiation routine requires three Toffoli gates; traditionally a single Toffoli gate can be decomposed into six CX gates and several single-qubit A. Modular exponentiation gates [2] as follows

Implementing Uˆ 1 on the two-qubit work register is simpliﬁed considerably by noting that the states |4i and • • • • T • |16i initially have zero amplitude, and thus the operation |1i 7→ |4i alone is suﬃcient. This operation can realized • = • • T T † with a CX gate controlled by |c2i targeting the second work qubit |q1i. H T † T T † T H (6) |c2i • • (3)

|q0i = Taking into account a given processor’s topology and 0 U 2 the constraints it poses, as well as other parts of the cir- |q1i cuit (the inverse QFT), further increases the tally of CX gates. This becomes undesirable as it is understood that Similarly, the implementation of Uˆ 2 can be simpliﬁed there is an upper limit on the number of CX gates that by noting that the states |1i and |4i are the only non- can be in a circuit with the guarantee of a successful com- zero amplitude states in the work register after Uˆ 1 may putation. The number of CX gates from the decompo- have been applied, thus prompting us to only consider sition of the Toﬀoli gate can be cut in half if we permit 5

FIG. 2. Compiled quantum order-ﬁnding routine for N = 21 and a = 4. This circuit uses ﬁve qubits in total; 3 for the control register and 2 for the work register. The above circuit determines 2ns/r to three bits of accuracy, from which the order can be π π extracted. Here, up to a global phase, S = Rz( 2 ) and T = Rz( 4 ) are phase and π/8 gates, respectively.

FIG. 3. Approximate compiled quantum order-finding routine implemented with Margolus gates in place of Toffoli gates in the construction in Fig.2. the operation to be correct up to relative phase shifts. never encounter the basis state |101i, thus leaving the Margolus constructed a gate that implements the Toffoli operation of the circuit unchanged. See Appendix B for gate up to a relative phase shift of |101i 7→ − |101i that details. This further compacting reduces the number of only uses three CX gates and four single qubit gates [21]. CX gates considerably and puts the algorithm within This construction has been shown to be optimal [22]. reach of current IBM processors with a limited number of noisy qubits. • • (7) IV. EXPERIMENTS • = • •

π π π π A. Physical qubit mapping + 4 + 4 − 4 − 4 Ry Ry Ry Ry The proposed compiled circuit in Fig.3 was mapped The advantages of relative phase Toffoli gates extend onto 5 physical qubits (3 control qubits and 2 work beyond the commonly conceived scenarios, i.e. when the qubits) and executed on a sub-processor of IBM’s 7- gate is applied last or when the relative phase shifts qubit quantum processor ibmq casablanca and 27- do not matter for certain configurations of multiply- qubit quantum processor ibmq toronto, which we will controlled Toffoli gates. Maslov reported circuit iden- refer to as 7Q and 27Q, and whose topologies are shown tities that permit the replacement of Toffoli gates with in Fig.4. When mapping the compiled circuit a few their relative phase variants in certain configurations, re- considerations can be taken into account. First, as can sulting in no overall change to the functionality in any be seen from Eq. (7), the Margolus gate can be imple- significant way [23]. The configuration in the circuit mented on a collinear set of qubits, as the first control shown in Fig.2 is one such configuration that permits qubit need not be connected to the second control qubit. a replacement of Toffoli gates with Margolus gates with- On the other hand, mapping the three-qubit inverse QFT out changing the overall functionality. All the Margo- onto physical qubits without incurring additional SW AP lus gates in the circuit in Fig.3 (which is the circuit in gates is not possible, as the three controlled-phase gates Fig.2 with the Toffoli gates replaced by Margolus gates) require all three qubits to be interconnected in a trian- 6

FIG. 4. Qubit topology of IBM Q experience processors. gle and the aforementioned quantum processors do not been applied to results and mitigates the effect of mea- have such a topology. Additionally, more SW AP gates surement errors on the raw results (see Appendix C). The are introduced to the transpiled circuit, as the processor outcomes |101i and |110i occur with probability ∼ 14% topologies do not permit the topology required by the and ∼ 16% respectively. The theoretical ideal probabil- compiled circuit, as shown in Fig.5. ity is ∼ 25%, as can be seen from the simulator results The only possible five-qubit mappings on the quantum in Fig.7. However, the amplification of the peaks |000i, processors are all isomorphic to either a collinear set of |101i and |110i is clearly visible from the processor out- qubits or a T-shaped set of qubits, as shown in Fig.6a comes. and b. Choosing the mapping in Fig.6 b over the one We quantify the successful performance of the algo- in Fig.6 a is motivated by the fact that the former is rithm by comparing the experimental and ideal proba- slightly more connected than latter and thus in effect bility distributions via the trace distance or Kolmogorov would reduce the number of SW AP gates in the mapped distance [2], which measures the closeness of two discrete and transpiled circuit. probability distributions P and Q and is defined by the P equation D(P,Q) ≡ x∈X |P (x) − Q(x)|/2, where X represents all possible outcomes. This measure shows an B. Performance agreement between measured and ideal results – the trace distance between the measured distribution and the ideal To evaluate the performance of the algorithm, we first distribution is 0.1694 and 0.1784 for ibmq toronto and transpiled the circuit in Fig.3 down to the chosen quan- ibmq casablanca, respectively. On the other hand, the tum processor with the mapping below trace distance between the ideal distribution and a can- didate random uniform distribution is 0.4347. Further-

0 7→ c0, more, we evaluate the performance of the algorithm by characterizing the measured output state in the control 1 7→ c , 2 register, this is achieved via state tomography yielding 4 7→ c1, the density matrix of the measured state. The measured 2 7→ q1, state and ideal state on the output register are quantita- 3 7→ q . (8) tively compared using the fidelity for two quantum states 0 p ρ and σ, and is defined to be F (ρ, σ) ≡ tr ρ1/2σρ1/2 [2]. Through the transpiler’s optimization, with the mapping We measured a fidelity of F (ρid, ρ27Q) = 0.6948 ± 00650 above it is possible to have a circuit that has 25 CX and F (ρid, ρ7Q) = 0.70 ± 0.0275 on the 27 qubit and gates and a circuit depth of 35. Fig.7 shows the results 7 qubit quantum processors respectively, as shown in of measurements on the control register qubits from the Fig.8. In Fig.9 we show the estimated density matrices two processors, where measurement error mitigation has in the computational basis for each respective device.

C. Factoring N = 21

The measured probability distributions in Fig.7 are peaked in probability for the outcomes 000 (ϕs = 0), 101 (ϕs = 5) and 110 (ϕs = 6), with ideal probabilities of 0.35, 0.25 and 0.25, respectively. Here we are using the integer representation of the binary outcomes. The out- FIG. 5. Qubit connections required by the compiled circuit come 000 corresponds to a failure of the algorithm [11]. in Fig.3. For the outcome 101, computing the continued fraction 7

FIG. 6. The two possible 5-qubit processor mappings on the architectures shown in Fig.4.

n 3 expansion of ϕ = ϕs/2 = 5/2 = 5/8 gives the convergents {0, 1, 1/2, 2/3, 5/8} (see Appendix G for details), so that the third convergent 2/3 in the expansion can be identified as s/r and correctly gives r = 3 as the order when tested with the relation armod N = 1, while the other convergents do not give an r that passes the test. On the other hand, the continued fraction expansion of ϕ = 6/8 gives {0, 1, 3/4} and incorrectly gives r = 4 as the order (see Appendix G for details). This failure can be avoided in principle by adding further qubits to the control register so that the peak in the probability distribution becomes narrower and more well defined [11]. Another option is to simply apply continued fractions to all peaked outcomes and test if the value of r found satisfies the order relation for a and N. It is interesting to note that from the results of Ref. [11], successfully finding the order r = 3 was not possible to achieve, as with only two bits of accuracy in the exper- FIG. 8. Boxplot of a sample (ν = 50) of state fidelities from iment the continued fractions would always fail due to the respective two devices showing the spread of the values the peaked outcomes of 10 (2) and 11 (3) giving the con- around the sample mean and 95% confidence intervals. vergents of {0, 1/2} and {0, 1, 3/4}, respectively. In our

case, we successfully ﬁnd r = 3, from which we obtain gcd(ar/2 ± 1,N) = gcd(8 ± 1, 21) = 3 and 7. Thus, with our demonstration, extending the number of outcome bits to three has allowed us to fully perform the quantum factoring of N = 21.

D. Veriﬁcation of entanglement

The presence of entanglement between the control and work registers is known to be a requirement for the algorithm to gain any advantageous speedup over its classical counterparts in general [6, 24, 25]. For detecting genuine multipartite entanglement around the vicinity of an ideal state |ψi, one can construct a projector-based witness such as the one below: ˆ Wψ = αI − |ψihψ| , (9) FIG. 7. Results of the complete quantum order-ﬁnding rou- where α is the square of the maximum overlap be- tine for N = 21 and a = 4. On each processor, the circuit tween |ψi and all biseparable states. In other words, was executed 8192 × 100 times with measurement error mitigation. The error bars represent 95% conﬁdence intervals tr Wˆ ψρ ≥ 0 for biseparable states and tr Wˆ ψρ < 0 around the mean value of each histogram bin (see Appendix for states with genuine multipartite entanglement in the D). The simulator probabilities show the ideal case. vicinity of |ψi [26]. For the ideal state after mod- 8

FIG. 9. Ideal and measured density matrices after the inverse QFT, estimated via a maximum-likelihood reconstruction from measurement results in the Pauli-basis. (a) The ideal state |ΨihΨ| (only the real parts are shown, imaginary parts are less than 0.04). (b) A matrix plot of the real part of |ΨihΨ|. (c) A matrix plot of the imaginary part of |ΨihΨ|. These plots are compared with the measured states ρ27Q and ρ7Q in panels (d) and (g), and the corresponding matrix plot of their real parts in panels (e) and (h), and imaginary parts in panels (f) and (i), respectively. We observe there is a resemblance between the ideal state and the measured states, but noise in both real and imaginary parts is notable. ular exponentiation (but before the inverse QFT) in (see Appendix E for details). We can do the same both the control and work registers, α = 0.75 was for each term in the set of terms from the Pauli de- found using the method described in the appendix of composition of ρ, calling it Sd, forming a set of other Ref. [26]. This was implemented using the software pack- Pauli terms that can be derived from the same counts. age QUBIT4MATLAB [27]. Therefore ideally the state Taking the union of these sets to be Su, the comple- in both registers after modular exponentiation has gen- ment Sd\Su gives the 79 terms we only need to mea- uine multipartite entanglement. sure (see Appendix E). We measure the 79 Pauli expec- In order to check whether the output state from the tation values of the terms above with respect to the state IBM processors is close to the ideal state and has genuine in both registers after modular exponentiation and from multipartite entanglement, full state tomography would this we compute/derive the 293 terms in Sd and there- normally be needed to characterize the state ρexp in both fore tr(|ΨihΨ| ρexp). The measured probabilities for each the control and work registers. This would require 35 term, some of them shown in Fig. 10, result in an ex- measurements, making it impractical to gather a suﬃ- pectation value of tr(|ΨihΨ| ρ7Q) = 0.677 ± 0.00365 and ciently large data set within a meaningful time frame. tr(|ΨihΨ| ρ27Q) = 0.626 ± 0.00304, which leads to However, we need not measure the full density matrix, the quantity tr(|ΨihΨ| ρexp) suﬃces. To measure this, we tr Wˆ ρ = 0.0729 ± 0.00365, can decompose ρ = |ΨihΨ| into 293 Pauli expectations as Ψ 7Q X (1) (2) (3) (4) (5) tr Wˆ Ψρ27Q = 0.124 ± 0.00304. (11) |ΨihΨ| = pijklmσi σj σk σl σm , (10) ijklm where σi = {I,X,Y,Z} are the usual Pauli matrices The results obviously fail to detect genuine multipar- plus the identity. However, the number of measure- tite entanglement, however, this does not mean entan- ments needed to obtain all 293 expectation values can glement is entirely absent. Consider the square of the be reduced [28]. This is because the measured proba- maximum overlap between the ideal state |Ψi and all bilities from a measurement of a single Pauli expecta- pure states |θi that are unentangled product states with tion value, i.e. hZZZZZi, can be summed in various respect to some bipartite partition (bipartition) B of the combinations to derive other Pauli expectations values, qubits, i.e. hZIZZZi , hIZZZZi, etc. The values derived are nothing but the marginalization of the measured prob- 2 max|hθ|Ψi| = βΨ. (12) abilities over the outcome space of some set of qubits θ∈B 9

FIG. 10. A subset of 9 of the 79 measurement settings required for each term in: (a) tr(|ΨihΨ| ρ7Q) and (b) tr(|ΨihΨ| ρ27Q). The x-axis from left to right shows the labels from p00000 to p11111.

Thus, any other state |ξi for which V. CONCLUDING REMARKS

|hξ|Ψi|2 > β (13) Ψ In summary, we have implemented a compiled version cannot be a product state with respect to the bipartition of Shor’s algorithm on IBM’s quantum processors for the B, implying that there is non-separability, or entangle- prime factorization of 21. By using relative phase shift ment, across this bipartition. The above result extends Toffoli gates, we were able to reduce the resource demands that would have been required in the standard to mixed states ρξ due to the convex sum nature of mixed quantum states [27]. We compute Eq. (12) for all pos- compiled and non-iterative construction of Shor’s algo- sible bipartitions of our ideal state |Ψi (see Appendix F rithm (with regular Toffoli gates), and still preserve its for more details). functional correctness. The use of relative phase shift Toffoli gates has also allowed us to extend the imple- For the experimental state ρ7Q we find, with the ex- mentation in Ref. [11] to an increased resolution. More- ception of the bipartition B = (c0c1c2q1)(q0), that it is non-separable with respect to all other bipartitions, i.e. over, while the latter implementation used only 1 recycled qubit for the control register, in contrast to our 3 the square of the overlap between ρ7Q and |Ψi (∼ 0.677) is greater than the maximal square overlap between |Ψi qubits, it falls one iteration short of achieving full fac- and all product states in each of these bipartitions. Sim- toring for the reasons already mentioned. It is not clear what additional resource overheads (single and two-qubit ilarly for ρ27Q, with the exception of bipartitions B = gates) would be needed in implementing another iteration (c0c1c2q1)(q0) and B = (c0c1c2q0)(q1), the state is non- separable with respect to all other bipartitions. Most no- in their scheme and it is likely that these overheads are what prevented the full factoring of 21 in the photonic tably, both ρ7Q and ρ27Q are non-separable with respect setup used. Furthermore, we note that in principle there to the bipartition B = (c0c1c2)(q0q1), which is a bipartition between the control and work registers. This implies is no real advantage in using 3 qubits for the control reg- that non-separability or entanglement is present between ister as we have done here instead of 1 qubit recycled, the registers, as required for the algorithm’s speedup in as in Ref. [11]. However, in practice it is not possible at general [6, 24, 25]. Furthermore, the maximum (not nec- present to recycle qubits on the IBM processors and so essarily global but a good proxy of it) expectation value we used 3 qubits instead. In future, once this capability of the operator |ΨihΨ| for product states, is found via a is added, a further reduction in resources will be pos- greedy search algorithm [27] to be around 0.30, further sible for our implementation, potentially improving the asserting that indeed the qubits are entangled with each quality of the results even more. other in some way. We have verified, via state tomography, the output 10 state in the control register for the algorithm, achiev- shortcomings of noisy quantum processors. In addition, ing a fidelity of around 0.70. For the verification of en- we suspect that we can further reduce the resource count tanglement generated during the algorithm’s operation, through the use of the approximate QFT [35], while still the resource demands of state tomography were circum- maintaining a clear resolution of the peaks in the out- vented by measuring a much reduced number of Pauli put probability distribution. A possible avenue of fu- measurements to uniquely identify a quantum state [28]. ture research derived from what we have reported here However, this method is quite specialized and cannot be is the investigation and identification of scenarios where easily generalized to larger systems. In scaling up Shor’s one can replace Toffoli gates with relative phase Toffoli algorithm to higher integers beyond 21 using larger quan- gates while preserving the functional correctness, in a tum systems, other methods of quantum tomography wide range of algorithms including Shor’s algorithm, as can be used to characterize the performance. These in- seen here. In the present case, whether such an approach clude compressed sensing [29] and classical shadows [30], is special to the case of N = 21 or extendable to other N which give theoretical guarantees, and improved scaling is not known. Ref. [23] has performed some work in this in the number of Pauli measurements and classical post- regard, however a proper analysis and systematic com- processing than standard methods. In the case when position of relative phase Toffoli gates for such purposes the state belongs to a class of states with certain sym- is still an open problem. In future, a similar approach metries, such as stabilizer states, only a few measure- may make possible the factorization of larger numbers ments are required for measuring the fidelity and de- with adequate accuracy in resolution of the algorithm’s tecting multipartite entanglement [31]. However, not outcomes and their characterization. all entangled states are neatly housed within these well- studied classes. Ref. [32] introduces a device-independent method for multipartite entanglement detection which ACKNOWLEDGMENTS scales polynomially with the system size by relaxing some constraints. Another scheme constructs witnesses that require a constant number of measurements of the sys- We acknowledge the use of IBM Quantum services for tem size at the cost of robustness against white noise. this work. The views expressed are those of the authors, This provides a fast and simple procedure for entangle- and do not reflect the official policy or position of IBM ment detection [33]. Many fundamental questions on the or the IBM Quantum team. We thank Taariq Surtee subjects of quantum tomography and multipartite entan- and Barry Dwolatzky at the University of the Witwa- glement still remain to be answered [34] and advances tersrand and Ismail Akhalwaya at IBM Research Africa will help in efficiently quantifying the performance of al- for access to the IBM processors through the Q Net- gorithms in larger quantum processors. work and African Research Universities Alliance. This Our demonstration involves a two-fold reduction of the research was supported by the South African National resource count from the full circuit in Fig.2 via the Research Foundation, the South African Council for Sci- replacement of regular Toffoli gates with relative phase entific and Industrial Research, and the South African variants, which is an approach that is in the spirit of the Research Chair Initiative of the Department of Science NISQ era; tailoring quantum circuits to circumvent the and Technology and National Research Foundation.

[1] Shor, P. W. Polynomial-time algorithms for prime factoring algorithm using photonic qubits. Phys. Rev. Lett. 99, ization and discrete logarithms on a quantum computer. 250504 (2007). SIAM Journal on Computing 26, 1484-1509 (1997). [8] Lanyon, B. P. et al. Experimental demonstration of a [2] Nielsen, M. A. & Chuang, I. L. Quantum Computa- compiled version of Shor’s algorithm with quantum en- tion and Quantum Information: 10th Anniversary Edi- tanglement. Phys. Rev. Lett. 99, 250505 (2007). tion (Cambridge University Press, USA, 2011), 10th edn. [9] Politi, A., Matthews, J. C. F. & O’Brien, J. L. Shor’s [3] de Wolf, R. Quantum computing: Lecture notes (2019). quantum factoring algorithm on a photonic chip. Science 1907.09415. 325, 1221-1221 (2009). [4] Vandersypen, L. M. K. et al. Experimental realization of [10] Lucero, E. et al. Computing prime factors with a joseph- Shor’s quantum factoring algorithm using nuclear mag- son phase qubit quantum processor. Nature Physics 8, netic resonance. Nature 414, 883-887 (2001). 719-723 (2012). [5] Peng, X. et al. A quantum adiabatic algorithm for fac- [11] Mart´ın-López, E. et al. Experimental realization of torization and its experimental implementation. Phys. Shor’s quantum factoring algorithm using qubit recy- Rev. Lett. 101, 220405 (2008). cling. Nature Photonics 6, 773-776 (2012). [6] Vidal, G. Efficient classical simulation of slightly entan- [12] Griffiths, R. B. & Niu, C.-S. Semiclassical Fourier trans- gled quantum computations. Phys. Rev. Lett. 91, 147902 form for quantum computation. Phys. Rev. Lett. 76, (2003). 3228-3231 (1996). [7] Lu, C.-Y., Browne, D. E., Yang, T. & Pan, J.-W. Demon- [13] Chiaverini, J. et al. Implementation of the semiclassical stration of a compiled version of Shor’s quantum factor- quantum Fourier transform in a scalable system. Science 11

308, 997-1000 (2005). [26] Bourennane, M. et al. Experimental detection of multi- [14] Amico, M., Saleem, Z. H. & Kumph, M. An experimental partite entanglement using witness operators. Phys. Rev. study of Shor’s factoring algorithm on ibm q. Phys. Rev. Lett. 92, 087902 (2004). A 100, 012305 (2019). [27] Tóth,G. Qubit4matlab v3.0: A program package for [15] Pal, S., Moitra, S., Anjusha, V. S., Kumar, A. & Ma- quantum information science and quantum optics for hesh, T. S. Hybrid scheme for factorisation: Factoring matlab. Comput. Phys. Commun. 179, 430-437 (2008). 551 using a 3-qubit NMR quantum adiabatic processor. [28] Ma, X. et al. Pure-state tomography with the expectation Pramana 92, 26 (2019). value of pauli operators. Phys. Rev. A 93, 032140 (2016). [16] Xu, N. et al. Quantum factorization of 143 on a dipolar- [29] Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S. & Eis- coupling nuclear magnetic resonance system. Phys. Rev. ert, J. Quantum state tomography via compressed sens- Lett. 108, 130501 (2012). ing. Phys. Rev. Lett. 105, 150401 (2010). [17] Saxena, A., Shukla, A. & Pathak, A. (2020). 2009.05840. [30] Huang, H.-Y., Kueng, R. & Preskill, J. Predicting many [18] Parker, S. & Plenio, M. B. Efficient factorization with properties of a quantum system from very few measure- a single pure qubit and logN mixed qubits. Phys. Rev. ments. Nature Physics 16, 1050-1057 (2020). Lett. 85, 3049-3052 (2000). [31] Tóth,G. & Gühne,O. Entanglement detection in the [19] Beauregard, S. Circuit for Shor’s algorithm using 2n+3 stabilizer formalism. Phys. Rev. A 72, 022340 (2005). qubits. Quantum Info. Comput. 3, 175-185 (2003). [32] Baccari, F., Cavalcanti, D., Wittek, P. & Ac´ın,A. Effi- [20] Beckman, D., Chari, A. N., Devabhaktuni, S. & Preskill, cient device-independent entanglement detection for mul- J. Efficient networks for quantum factoring. Phys. Rev. tipartite systems. Phys. Rev. X 7, 021042 (2017). A 54, 1034-1063 (1996). [33] Knips, L., Schwemmer, C., Klein, N., Wie´sniak,M. & [21] Margolus, N. Simple quantum gates. Unpublished Weinfurter, H. Multipartite entanglement detection with manuscript (circa 1994) (1994). minimal effort. Phys. Rev. Lett. 117, 210504 (2016). [22] Song, G. & Klappenecker, A. Optimal realizations of [34] Banaszek, K., Cramer, M. & Gross, D. Focus on quantum simplified Toffoli gates. Quant. Inf. Comp. 4, 361-372 tomography. New Journal of Physics 15, 125020 (2013). (2004). [35] Barenco, A., Ekert, A., Suominen, K.-A. & Törmä,P. [23] Maslov, D. Advantages of using relative-phase Toffoli Approximate quantum Fourier transform and decoher- gates with an application to multiple control Toffoli op- ence. Phys. Rev. A 54, 139–146 (1996). timization. Phys. Rev. A 93, 022311 (2016). [36] HéctorAbraham et al. Qiskit: An Open-source Frame- [24] Braunstein, S. L. et al. Separability of very noisy mixed work for Quantum Computing. Available at 10.5281/zen- states and implications for NMR quantum computing. odo.2562110 (2019) Phys. Rev. Lett. 83, 1054-1057 (1999). [37] Smolin, J. A., Gambetta, J. M. & Smith, G. Efficient [25] Jozsa, R. & Linden, N. On the role of entanglement Method for Computing the Maximum-Likelihood Quan- in quantum-computational speed-up. Proc. Roy. Soc. A tum State from Measurements with Additive Gaussian 459, 2011-2032 (2003). Noise. Phys. Rev. Lett. 108, 070502 (2012). [38] Qiskit Learn quantum computing using Qiskit. Available at https://qiskit.org/textbook/ch-quantum- hardware/measurement-error-mitigation.html 12

Appendix A: Postselection scaling

In order to do mid-circuit measurements and post select the outcomes, we need to know the basis to measure in for each of the qubits. Thus, one would need to measure qubit 1 (or the ﬁrst iteration in the recycling case) in the {|+i , |−i} basis, then qubit 2 (or the second iteration in the recycling case) in either the {S |+i ,S |−i} or {|+i , |−i} basis, then qubit 3 in either the {TS |+i ,TS |−i}, {S |+i ,S |−i} or {|+i , |−i} basis. Thus, the number of measurements needed scales as n!, which grows faster than an exponential with constant base, e.g. 2n. So in general the speed up gained would be lost for general factoring using a post selection method, i.e. factoring numbers larger than 21.

Appendix B: Eﬀect of relative phase Toﬀolis

Below we show the compiled circuit for the period-finding routine and label specific instances during the evolution of the computation. The aim is to show the invariance of the computation when replacing Toffoli gates with relative phase Toffoli gates that use fewer resources.

|c0i H • •

† |c1i H • • QF T

|c2i H •

|q0i • •

|q1i • • X • X • • ⇑ ⇑ ⇑ ⇑ |Ψ0i |Ψ1i |Ψ2i |Ψ3i

FIG. 11. States in both registers at various points during the execution of the circuit.

The states are various points of the evolution are given explicitly as

|Ψ i = |+i |+i |+i |0i |0i , 0 c0 c1 c2 q0 q1

|Ψ i = |+i (|0i |0i |0i |0i + |0i |1i |0i |1i + 1 c0 c1 c2 q0 q1 c1 c2 q0 q1 |1i |0i |0i |1i + |1i |1i |0i |0i ), c1 c2 q0 q1 c1 c2 q0 q1

|Ψ i = |0i |0i |0i |0i |0i + |0i |0i |1i |0i |1i + 2 c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |0i |1i |0i |1i |0i + |0i |1i |1i |0i |0i + c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |1i |0i |0i |0i |0i + |1i |0i |1i |0i |1i + c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |1i |1i |0i |1i |0i + |1i |1i |1i |0i |0i , c0 c1 c2 q0 q1 c0 c1 c2 q0 q1

|Ψ i = |0i |0i |0i |0i |0i + |0i |0i |1i |0i |1i + 3 c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |0i |1i |0i |1i |0i + |0i |1i |1i |0i |0i + c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |1i |0i |0i |1i |0i + |1i |0i |1i |0i |1i + c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 |1i |1i |0i |0i |0i + |1i |1i |1i |1i |0i . (B1) c0 c1 c2 q0 q1 c0 c1 c2 q0 q1 Looking at the state |Ψ i, one can see that none of its constituent states is transformed into |1i |0i |1i by the 1 c1 q0 q1 CX gate that follows, since the state |1i |1i |1i that would be transformed to the former is not present in |Ψ i. c1 q0 q1 1 Thus the relative phase Toffoli gate does not affect the phase in the registers. Similarly for |Ψ i, the state |1i |1i |0i is not present when the subsequent relative phase Toffoli gate is applied 2 c0 q0 q1 because the state |1i |1i |1i is absent from the register for |Ψ i and this is needed when Xˆ is applied to qubit q . c0 q0 q1 2 1 The scenario for |Ψ3i is the same as that of |Ψ1i, the only difference is the control is now c0. 13

The Margolus gates in this particular quantum circuit never encounter the basis state |101i, thus the operation of the circuit remains unchanged by the replacement of full Toﬀoli gates with their respective relative phase counterparts.

Appendix C: IBM Quantum Experience

The experiments in this paper were conducted on the IBM Quantum Experience ibmq toronto and ibmq casablanca processors through the software development kit Qiskit [36]. Each experiment reported here was conducted on the date shown in the table below.

Experiment Date Compiled quantum order-finding on ibmq casablanca 2020/12/03 State tomography on ibmq casablanca 2020/12/04 Verification of entanglement on ibmq casablanca 2020/12/04 Compiled quantum order-finding on ibmq toronto 2020/12/06 Verification of entanglement on ibmq toronto 2020/12/07 State tomography on ibmq toronto 2020/12/16

TABLE I. Dates of experiments.

For characterization purposes, the compiled quantum order-ﬁnding experiments were submitted in batches of 900 circuits with each circuit having 8192 measurement shots. In total, 900 × 8192 measurements were made. In choosing the qubit device mappings shown in the main paper, preference was given to the qubit pairs with relatively small CX error rates. TablesII andIII show reported single qubit-error rates for ibmq toronto and ibmq casablanca π respectively, where U2(φ, λ) = Rz(φ)Ry( 2 )Rz(λ). TableIV shows the CX error rates for the two processors. The dates of the experiments are given in the captions.

U2 gate error rate Readout error rate Q0 6.010 × 10−2 4.39 × 10−4 Q1 3.14 × 10−2 2.12 × 10−4 Q2 2.98 × 10−2 1.96 × 10−4 Q3 9.30 × 10−3 5.74 × 10−4 Q4 1.34 × 10−2 2.097 × 10−4

TABLE II. Reported single-qubit gate errors on 16 December 2020.

U2 gate error rate Readout error rate Q0 2.16 × 10−2 2.18 × 10−4 Q1 1.31 × 10−2 4.042 × 10−4 Q2 1.54 × 10−2 2.78 × 10−4 Q3 9.30 × 10−2 2.62 × 10−4 Q4 1.67 × 10−2 4.96 × 10−4

TABLE III. Reported single-qubit gate errors on 06 December 2020.

Qiskit’s state tomography fitter uses a least-squares fitting to find the closest density matrix described by Pauli measurement results [37]. On an n-qubit system, the fitter requires measurement results from executing 3n circuits. This makes state tomography on large circuits in impractical. Thus only 30 state tomography experiments were performed for the three control register qubits and in total 33 × 30 × 8192 measurement were made. In reducing the effect of noise due to final measurement errors, Qiskit recommends a measurement error mitigation approach. The approach starts off by creating circuits that each perform a measurement of the 2n basis states. The n measurement counts of the 2 basis state measurements are put into a column vector Cnoisy, arranged in ascending order by the value of their measurement bitstring, i.e. 00 ... 00 is the first element, the next is 00 ... 01 and so on. The approach assumes that there is a matrix M called the calibration matrix, such that

Cnoisy = MCideal, (C1) 14

ibmq toronto ibmq casablanca CX (0,1) 6.620 × 10−3 9.126 × 10−3 CX (1,4) 8.214 × 10−3 1.114 × 10−2 CX (2,1) 7.152 × 10−3 7.446 × 10−3 CX (3,2) 6.824 × 10−3 1.337 × 10−2

TABLE IV. Reported CX gate errors on 06 December (ibmq casablanca) and 16 December (ibmq toronto) 2020.

where Cideal is a column vector of measurement counts in the absence of noise. If M is invertible then, then Cnoisy −1 can transformed into Cideal by ﬁnding M

−1 Cideal = M Cnoisy. (C2)

Qiskit [38] uses a least-squares ﬁt to calculate an approximate M −1 by some other matrix M˜ −1, as in general M is not invertible, giving

−1 Cmitigated = M˜ Cnoisy. (C3)

The entries of the column vector Cmitigated correspond to the mitigated measurement counts in same order as before. The entirety of the results reported in our work make of use of this approach.

Appendix D: Error bars

All the conﬁdence intervals of the data presented here were established via non-parametric bootstrap resampling techniques. In order to place the constraint that the measurement counts should sum to the number of experimental shots, a sample contains data as column vectors of outcomes of some experiment. In each round, the resampling draws entire column vectors whose elements respect the aforementioned constraint. For each outcome across the column vectors, mean estimates are obtained and a conﬁdence interval around the estimates can be appropriately constructed. To elucidate the above, consider the following example. Consider the outcomes of a two-qubit experiment with experimental shots of 8192 repeated 4 times, as shown in TableV below.

Outcomes Counts Exp. 1 Exp. 2 Exp. 3 Exp. 4 00 2335 2208 2406 2203 01 665 690 633 656 10 183 100 197 177 11 5009 5192 4956 5156

TABLE V. Example data for a two-qubit experiment repeated 4 times for illustrating how bootstrap resampling was done.

Suppose we resampled the experiments 1, 1, 2, 4 from TableV, making a bootstrap sample of size 5.

B = [[2335, 665, 183, 5009], [2335, 665, 183, 5009], [2208, 690, 100, 5192], [2203, 656, 177, 5156]]. (D1)

From this, we can obtain appropriately the bootstrap sample for each outcome (corresponding to an index), e.g. the bootstrap sample for the outcomes at index 0 (outcome 00) is

B0 = [2335, 2335, 2208, 2203]. (D2)

The bootstrap mean estimates and conﬁdence intervals can then be performed for each outcome while respecting the constraint of the measurement counts summing up to the total number of experimental shots. 15

Appendix E: Pauli measurements

As an example, consider the measurement of the Pauli expectation value hZZZZZi. Let pijklm denote the probability for a computational basis measurement {|0i , |1i} of ﬁve qubits to output the binary string ijklm, i.e. p00000 denotes the probability to measure all the qubits in |0i state. To calculate hZZZZZi we can combine these probabilities as given in the equation below

hZZZZZi = p00000 − p00010 − p00100 + p00101 + p00110 − p01000 + p01001 + p01010 + p01100 − p01101−

p01110 + p01111 − p10000 + p10001 + p10010 − p10011 + p10100 − p10101 − p10110 + p10111+

p11000 − p11001 − p11010 + p11011 − p11100 + p11101 + p11110 − p11111. (E1)

Similarly, the expectation hIZIZIi is given by

hIZIZIi = p00000 − p00010 + p00100 + p00101 − p00110 − p01000 − p01001 + p01010 − p01100 − p01101+

p01110 + p01111 + p10000 + p10001 − p10010 − p10011 + p10100 + p10101 − p10110 − p10111−

p11000 − p11001 + p11010 + p11011 − p11100 − p11101 + p11110 + p11111. (E2)

However, the terms in the equation above are given by the marginalization of the distribution measured in Eq. (E1) across the outcome space of qubits 1, 3 and 5. By considering all such marginalizations of the distribution in Eq. (E1), we obtain the set of Pauli expectation values that can be derived from a measurement of hZZZZZi, namely

{ ZZZZI,ZZZIZ,ZZZII,ZZIZZ,ZZIZI,ZZIIZ,ZZIII,ZIZZZ,ZIZZI,ZIZIZ,ZIZII,ZIIZZ, ZIIZI,ZIIIZ,ZIIII,IZZZZ,IZZZI,IZZIZ,IZZII,IZIZZ,IZIZI,IZIIZ,IZIII,IIZZZ, IIZZI,IIZIZ,IIZII,IIIZZ,IIIZI,IIIIZ }. (E3)

After applying what is described above to the Pauli decomposition of the ideal state ρ = |ΨihΨ|, we reduce the number of terms that we need to measure from 293 to 79 terms, as given below

{ XXXXZ,XXXZX,XXXZZ,XXYYZ,XXYZY,XXZXX,XXZXZ,XXZYY,XXZZX,XYXYZ, XYXZY,XYYXZ,XYYZX,XYYZZ,XYZXY,XYZYX,XYZYZ,XYZZY,XZXXX,XZXYY, XZXZZ,XZYXY,XZYYX,XZZXZ,XZZZX,YXXYZ,YXXZY,YXYXZ,YXYZX,YXYZZ, YXZXY,YXZYX,YXZYZ,YXZZY,YYXXZ,YYXZX,YYXZZ,YYYYZ,YYYZY,YYZXX, YYZXZ,YYZYY,YYZZX,YZXXY,YZXYX,YZYXX,YZYYY,YZYZZ,YZZYZ,YZZZY, ZXXXZ,ZXXZX,ZXXZZ,ZXYYZ,ZXYZY,ZXZXX,ZXZXZ,ZXZYY,ZXZZX,ZYXYZ, ZYXZY,ZYYXZ,ZYYZX,ZYYZZ,ZYZXY,ZYZYX,ZYZYZ,ZYZZY,ZZXXX,ZZXXZ, ZZXYY,ZZXZX,ZZYXY,ZZYYX,ZZYYZ,ZZYZY,ZZZXX,ZZZYY,ZZZZZ }. (E4)

Appendix F: Maximum overlap with respect to the bipartitions

The values listed below were obtained using the software package QUBIT4MATLAB [27]. Here, |φi is a pure biseparable state in some deﬁned bipartite partition (bipartition), i.e. an unentangled product state with respect to this bipartition, and |Ψi is the ideal state in both the control and work registers preceding the application of the QFT 16 to the control register.

max |hφ|Ψi|2 = 0.500, φ∈{(c1)(c0c2q0q1)} max |hφ|Ψi|2 = 0.500, φ∈{(c2)(c0c1q0q1)} max |hφ|Ψi|2 = 0.750, φ∈{(q0)(c0c1c2q1)} max |hφ|Ψi|2 = 0.625, φ∈{(q1)(c0c1c2q0)} max |hφ|Ψi|2 = 0.500, φ∈{(c0c1)(c2q0q1)} max |hφ|Ψi|2 = 0.500, φ∈{(c0c2)(c1q0q1)} max |hφ|Ψi|2 = 0.427, φ∈{(c0q0)(c1c2q1)} max |hφ|Ψi|2 = 0.570, φ∈{(c0q1)(c1c2q0)} max |hφ|Ψi|2 = 0.375, φ∈{(q0q1)(c0c1c2)} max |hφ|Ψi|2 = 0.570, φ∈{(c0q1)(c0c1q0)} max |hφ|Ψi|2 = 0.570, φ∈{(c1q1)(c0c2q0)} max |hφ|Ψi|2 = 0.427, φ∈{(c1q0)(c0c2q1)} max |hφ|Ψi|2 = 0.427, φ∈{(c2q0)(c0c1q1)} max |hφ|Ψi|2 = 0.500. (F1) φ∈{(c1c2)(c0q0q1)}

For a given separation of the qubits into two partitions (a bipartition), e.g. (c1)(c0c2q0q1), there is a pure product state |φi with respect to these partitions, i.e. no entanglement between the partitions, that maximizes the overlap squared with the ideal state. The value of the overlap squared between this product state and the ideal state, e.g. 0.5, is therefore the highest value that can be obtained for an unentangled state between the partitions. Thus, if a given state has an overlap squared larger than 0.5 it must be an entangled state with respect to the partitions. The value of the maximum overlap squared changes for the diﬀerent partitions chosen as it depends on the structure of the ideal state. The above results extend to mixed states across the bipartitions due to the convex sum nature of quantum states [27].

Appendix G: Continued fractions and convergents

A 2L + 1 bit rational number ϕ is said to have a continued fraction expansion if it can be written as 1 ϕ ≡ [a0, a1, . . . , an] ≡ a0 + 1 , (G1) a1 + 1 a2+ ···+ 1 an where n is a ﬁnite integer and the ai’s are integers. Additionally, if ϕ < 1, we have a0 = 0. The convergents of the continued fraction expansion are the rationals,

1 1 a0, a0 + , a0 + 1 , ··· (G2) a1 a1 + a2 If a rational number s/r satisﬁes the following inequality s 1 − ϕ ≤ , (G3) r 2r2 17 then s/r will appear as a convergent in the continued fraction expansion of ϕ. If ϕ is an approximation of s/r accurate to 2L + 1 bits, then we have |s/r − ϕ| ≤ 1/22L+1. For r ≤ N ≤ 2L, we have that 1/22L+1 ≤ 1/2r2. Therefore, since the inequality holds for the approximation ϕ, there is a classical algorithm that can compute the convergents of ϕ, and produce integers s0, r0 such that gcd(s0, r0) = 1 in O(L3) operations [2]. We can then check if r0 is the order of r0 n a and N by testing whether a mod N = 1. Note that in our approach, ϕ = ϕs/2 ' s/r is not an approximation that is accurate to 2L + 1 bits as above, but is a further approximation of s/r depending on the resolution, i.e. the number of iterations, or alternatively qubits in the control register. Consider the following example of the ﬁnal measurement outcomes from Fig. 7 in the main text, where the outcomes |110i = |6i and |101i = |5i are peaked in the outcome distribution and we have used the integer representation of the 6 5 binary outcome. The former outcome gives ϕ = 23 and latter gives ϕ = 23 . Computing the continued fractions of the former gives 6 3 = , 8 4 3 1 = 0 + 4 , 4 3 3 1 = 0 + 1 . (G4) 4 1 + 3 Thus 6 = [0, 1, 3]. (G5) 8 Computing the convergents according to Eq. (G2) gives 0, 1, 3/4. On the other hand, computing the continued fractions of the latter ϕ gives 5 1 = 0 + 8 , 8 5 5 1 = 0 + 3 , 8 1 + 5 5 1 = 0 + 1 , 8 1 + 5 3 5 1 = 0 + 1 , 8 1 + 2 1+ 3 5 1 = 0 + 1 , 8 1 + 1 1+ 3 2 5 1 = 0 + . (G6) 8 1 1 + 1+ 1 1+ 1 2 This gives 5 = [0, 1, 1, 1, 2]. (G7) 8 Computing the convergents gives 0, 1, 1/2, 2/3, 5/8. Looking at the former and latter computed convergents, we note that the third convergent of the latter correctly gives r0 = 3 while the convergents of the former do not give the correct 0 order when tested using ar mod N = 1.