<<

www.nature.com/npjqi

ARTICLE OPEN of programmable computers ✉ Leonardo Banchi1,2 , Jason Pereira 3, Seth Lloyd4,5 and Stefano Pirandola 3,5

A fundamental model of quantum computation is the programmable quantum gate array. This is a quantum processor that is fed by a program state that induces a corresponding on input states. While being programmable, any finite- dimensional design of this model is known to be nonuniversal, meaning that the processor cannot perfectly simulate an arbitrary quantum channel over the input. Characterizing how close the simulation is and finding the optimal program state have been open questions for the past 20 years. Here, we answer these questions by showing that the search for the optimal program state is a convex optimization problem that can be solved via semidefinite programming and gradient-based methods commonly employed for machine learning. We apply this general result to different types of processors, from a shallow design based on , to deeper schemes relying on port-based teleportation and parametric quantum circuits. npj (2020) 6:42 ; https://doi.org/10.1038/s41534-020-0268-2

INTRODUCTION the program states. This already solves an outstanding problem Back in 1997 a seminal work by Nielsen and Chuang1 proposed a that affects various models of quantum computers (e.g., quantum version of the programmable gate array that has variational quantum circuits), where the optimization over classical 1234567890():,; become a fundamental model for quantum computation2. This parameters is non-convex and therefore not guaranteed to is a quantum processor where a fixed quantum operation is converge to a global optimum. By contrast, because our problem applied to an input state together with a program state. The aim is proven to be convex, we can use SDP to minimize the diamond fi of the program state is to induce the processor to apply some distance and always nd the optimal program state for the target quantum gate or channel3 to the input state. Such a desired simulation of a target channel, therefore optimizing the program- mable quantum processor. Similarly, we may find suboptimal feature of quantum programmability comes with a cost: the fi model cannot be universal, unless the program state is allowed to solutions by minimizing the trace distance or the quantum delity have an infinite dimension, i.e., infinite qubits1,4. Even though this by means of gradient-based techniques adapted from the ML literature, such as the projected subgradient method14 and limitation has been known for many years, there is still no exact the conjugate gradient method15,16. We note indeed that the characterization on how well a finite-dimensional programmable minimization of the ℓ1-norm, mathematically related to the quantum processor can generate or simulate an arbitrary 17,18 fi quantum trace distance, is widely employed in many ML tasks , quantum channel. Also there is no literature on how to nd the so many of those techniques can be adapted for learning program corresponding optimal program state or even to show that this states. state can indeed be found by some optimization procedure. Here, With these general results in our hands, we first discuss the we show the solutions to these long-standing open problems. optimal learning of arbitrary unitaries with a generic program- Here, we show that the optimization of programmable mable quantum processor. Then, we consider specific designs of quantum computers is a convex problem for which the solution the processor, from a shallow scheme based on the teleportation fi can always be found by means of classical semide nite program- protocol, to higher-depth designs based on port-based teleporta- ming (SDP), and classical gradient-based methods that are tion (PBT)19–21 and parametric quantum circuits (PQCs)22, introdu- commonly employed for machine learning (ML) applications. ML 5 cing a suitable convex reformulation of the latter. In the various methods have found wide applicability across many disciplines , cases, we benchmark the processors for the simulation of basic and we are currently witnessing the development of new hybrid unitary gates ( rotations) and various basic channels, areas of investigation, where ML methods are interconnected with including the amplitude damping channel that is known to be quantum , such as quantum-enhanced ML fi 23,24 fi – the most dif cult to simulate . For the deeper designs, we nd (refs. 6 10; e.g., quantum neural networks, , that the optimal program states do not correspond to the Choi etc.), protocols of quantum-inspired ML (e.g., for recommendation matrices of the target channels, which is rather counterintuitive systems11 or component analysis and supervised clustering12), and unexpected. and classical learning methods applied to quantum computers, as explored here in this manuscript. In our work, we quantify the error between an arbitrary target RESULTS channel and its programmable simulation in terms of the diamond We first present our main theoretical results on how to train the distance3,13, and other suitable cost functions, including the trace program state of programmable quantum processors, either via distance and the quantum fidelity. For all the considered cost convex optimization or first-order gradient-based algorithms. We functions, we are able to show that the minimization of the then apply our general methods to study the learning of arbitrary simulation error is a convex optimization problem in the space of unitaries, and the simulation of different channels via processors

1Department of Physics and Astronomy, University of Florence, via G. Sansone 1, I-50019 Sesto Fiorentino (FI), Italy. 2INFN Sezione di Firenze, via G.Sansone 1, I-50019 Sesto Fiorentino (FI), Italy. 3Department of Computer Science, University of York, York YO10 5GH, UK. 4Department of Mechanical Engineering, Massachusetts Institute of Technology ✉ (MIT), Cambridge, MA 02139, USA. 5Research Laboratory of Electronics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA. email: leonardo.banchi@unifi.it

Published in partnership with The University of New South Wales L. Banchi et al. 2 the processor Q. Then, using results from refs. 3,25,26, we may write pffiffiffiffiffiffiffiffiffiffiffiffi CÅðπÞdC1ðπÞ2d CFðπÞ; (6) where π : χ χ ; C1ð Þ ¼ kkE À π 1 (7) is the trace distance2 between target and simulated Choi matrices, and Fig. 1 Quantum processor Q with program state π that simulates 2 CFðπÞ¼1 À FðπÞ ; (8) a quantum channel Eπ from input to output. We also show the CPTP map Λ of the processor, from the program state π to the π ’ fi χ where F( ) is Bures delity between the two Choi matrices E and output Choi matrix χπ (generated by partial transmission of the χ Φ π, i.e., maximally entangled state ). qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffipffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi FðπÞ :¼ χ χπ ¼ Tr χ χπ χ : (9) built either from quantum teleportation and its generalization, or E 1 E E from PQCs. Another possible upper bound can be written using the quantum Pinsker’s inequality27,28. In fact, we may write C π pffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi 1ð Þ Programmable ð2ln 2Þ CRðπÞ, where Let us consider an arbitrary mapping from d-dimensional input CRðπÞ :¼ minfgSðχ jjχπÞ; Sðχπjjχ Þ ; (10) states into d0-dimensional output states, where d0 ≠ d in the E E ρ σ : ρ ρ σ general case. This is described by a quantum channel E that may and Sð jj Þ ¼ Tr½ ðlog 2 À log 2 ފ is the quantum relative represent the overall action of a quantum computation and does entropy between ρ and σ. In Supplementary Note 1.3, we also not need to be a unitary transformation. Any channel E can be introduce a cost function Cp(π) based on the Schatten p-norm. simulated by means of a programmable quantum processor, which is modeled in general by a fixed completely positive trace- Convex optimization preserving (CPTP) map Q that is applied to both the input state One of the main problems in the optimization of reconfigurable and a variable program state π. In this way, the processor 1234567890():,; quantum chips is that the relevant cost functions are not convex in transforms the input state by means of an approximate channel the set of classical parameters. This problem is completely solved Eπ as here thanks to the fact that the optimization of a programmable EπðρÞ¼Tr2½ŠQðρ πÞ ; (1) quantum processor is done with respect to a . In fact, in the methods section, we prove the following where Tr2 is the over the program state. A Theorem 1 Consider the simulation of a target quantum channel 1 fi “ ” fundamental result is that there is no xed quantum processor E by means of a programmable quantum processor Q. The Q that is able to exactly simulate any quantum channel E. In other fi π optimization of the cost functions C◇, C1, CF, CR, or Cp is a convex terms, given E, we cannot nd the corresponding program such problem in the space of program states π. In particular, the global that EEπ. Yet simulation can be achieved in an approximate minimum π~ for C◇ can always be found as a local minimum. sense, where the quality of the simulation may increase for larger This convexity result is generally valid for any cost function that program dimension. In general, the open problem is to determine is convex in π. This is the case for any desired norm, not only the the optimal program state π~ that minimizes the simulation error, fi trace norm, but also the Frobenius norm, or any Schatten p-norm. that can be quanti ed by the cost function It also applies to the relative entropy. Furthermore, the result can π : ; also be extended to any convex parametrization of the program CÅð Þ ¼EÀEkkπ Å (2) 3,13 states. namely the diamond distance between the target channel E When dealing with convex optimization with respect to positive and its simulation Eπ. In other words, operators, the standard approach is to map the problem to a form ~ ~ 29,30 Find π such that CÅðπÞ¼minπCÅðπÞ: (3) that is solvable via SDP (refs. ). Since the optimal program is the one minimizing the cost function, it is important to write the 1,4 = From theory , we know that we cannot achieve C◇ 0 for computation of the cost function itself as a minimization. For π fi arbitrary E unless and Q have in nite dimensions. As a result, for the case of the diamond distance, this can be achieved by using fi any nite-dimensional realistic design of the quantum processor, the dual formulation29. More precisely, consider the linear map finding the optimal program state π~ is an open problem. Recall Ωπ :¼EÀEπ with Choi matrix χΩ ¼ χ À χπ ¼ χ À ΛðπÞ, and the fi : π E E that the diamond distance is de ned by kkEÀEπ Å ¼ : : Cd; φ φ spectral norm kkO 1 ¼ maxfkkffiffiffiffiffiffiffiffiffiOu u 2 kku 1g, which is maxφkkI EðpffiffiffiffiffiffiffiffiffiÞÀI Eπð Þ 1, where I is the identity map and p kkO :¼ Tr OyO is the trace norm2. the maximum eigenvalue of OyO. Then, by the strong of 1 π Ω 30 It is important to note that this problem can be reduced to a the diamond norm, CÅð Þ¼kkπ Å is given by the SDP (ref. ) simpler one by introducing the channel’s Choi matrix Minimize 2kk Tr2Z ; 1 (11) χ ¼I EπðΦÞ χ Λ π : Eπ P Subject to Z  0 and Z  dð E À ð ÞÞ À1 π ; (4) ¼ d jiihjj Tr2½ŠQðjiihjj Þ The importance of the above dual formulation is that the diamond ij distance is a minimization, rather than a maximization over a set where Φ :¼ jiΦ hjΦ is a d-dimensional maximally entangled state. of matrices. In order to find the optimal program π~, we apply the χ π fi From this expression, it is clear that the Choi matrix Eπ is linear in unique minimization of Eq. (11), where is variable and satis es π χ π ≥ π the program state . More precisely, the Choi matrix Eπ at the the additional constraints 0 and Trð Þ¼1. output of the processor Q can be directly written as a CPTP linear In the methods section, we show that other cost functions, such Λ π map acting on the space of the program states , i.e., as C1 and CF can also be written as SDPs. Correspondingly, the optimal programs π~ can be obtained by numerical SDP solvers. χπ :¼ χ ¼ ΛðπÞ: (5) Eπ Most numerical packages implement second-order algorithms, This map is also depicted in Fig. 1, and fully describes the action of such as the interior point method31. However, second-order

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 3 methods tend to be computationally heavy for large problem apply sizes32,33, namely when π~ contains many qudits. In the following 1Þ Find the smallest eigenvalue jσ i of ∇Cðπ Þ; section, we introduce first-order methods, that are better suited i i for larger program states. It is important to remark that there also i 2 (17) 2Þ πiþ1 ¼ πi þ jσiihjσi : exist zeroth-order (-free) methods, such as the simulta- i þ 2 i þ 2 34 neous perturbation stochastic approximation method , which When the gradient of f is Lipschitz continuous with constant L, the 35 was utilized for a quantum problem in ref. . However, it is known method converges after OðL=ϵÞ steps16,41. To justify the applic- that zeroth-order methods normally have slower convergence ability of this method, a suitable smoothening of the cost function 36 times compared to first-order methods. must be employed. The downside of the conjugate gradient method is that it necessarily requires a differentiable cost function Gradient-based optimization C, with gradient ∇C. Specifically, this may create problems for the In ML applications, where a large amount of data is commonly trace distance cost C1 that is generally non-smooth. A solution to fi available, there have been several works that study the minimiza- this problem is to de ne theÂà cost function in terms of the smooth 17,18,37,38 π χ χ tion of suitable matrix norms for different purposes . First- trace distance Cμð Þ¼Tr hμðÞπ À E , where hμ is the so-called 2 order methods, are preferred for large dimensional problems, as Huber penalty function hμ(x):= x /(2μ)if∣x∣ < μ and ∣x∣ − μ/2 if ∣ ∣ ≥ μ fi π ≤ π ≤ π + μ they are less computationally intensive and require less memory. x . This quantity satis es Cμ( ) C1( ) Cμ( ) d/2 and is a Here, we show how to apply first-order (gradient-based) over program states, with gradient ∇ π Λà 0 χ χ algorithms, which are widely employed in ML applications, to Cμð Þ¼ ½hμð π À E ފ). find the optimal quantum program. For this purpose, we need to introduce the subgradient of the Learning of arbitrary unitaries cost function C at any point π 2S, which is the set One specific application is the simulation of quantum gates or, more generally, unitary transformations22,42–45. Here, the infidelity ∂CðπÞ¼fZ : CðσÞÀCðπÞ Tr ½Zðσ À πފ; 8σ 2Sg; (12) provides the most convenient cost function, as the optimal where Z is Hermitian 39,40.IfC is differentiable, then ∂C(π) contains program can be found analytically. In fact, suppose we use a a single element: its gradient ∇ C(π). We explicitly compute this quantum processor with map Λ to simulate a target unitary U. χ χ fi gradient for an arbitrary programmable quantum processor (1) Because the Choi matrix of U is pure jiU hjU ,we rst note that χ χ Λ π F π 2 χ Λ π χ and then we see that Eq. (15) drastically whose Choi matrix Eπ  π ¼ ð Þ, can be written as a quantum ð Þ ¼ hjU ð Þj Ui qffiffiffiffiffiffiffiffiffiffiffiffiffiffi Λ ’ channel that maps a generic program state to the processor s simplifies to ∇FðπÞ¼ΛÃðÞjχ ihjχ = 4FðπÞ2. As a result, we find Choi matrix. This map can be defined by its Kraus decomposition U U P ∇ π Λà χ χ ; Λ π A πAy for some operators A . In fact, let us call Λà ρ CFð Þ¼À ½Šj UihjU (18) Pð Þ¼ k k k k ð Þ¼ y ρ where there is no dependence on π. Therefore, using the kAk Ak the dual map, then in the methods section we prove the following conjugate gradient method in Eq. (17), we see that the optimal π~ fi fi Theorem 2 Consider an arbitrary quantum channel E with Choi program state for the in delity cost function CF is a xed point of χ Λ π the iteration and is equal to the maximum eigenvector of matrix E that is simulated by a quantum processor Q with map ( ) * Λà χ χ = χπ (and dual map Λ ). Then, we may write the following gradients ½ŠjiU hjU . for the trace distance cost C1(π) and the infidelity cost CF(π) X Teleportation processor ∇ π λ Λà ; C1ð Þ¼ signð kÞ ðPkÞ (13) Once we have shown how to optimize a generic programmable k quantum processor, we discuss some specific designs, over which pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi we will test the optimization procedure. One possible (shallow) ∇CFðπÞ¼À2 1 À CF ðπÞ∇FðπÞ; (14) design for the quantum processor Q is a generalized teleportation protocol46 over an arbitrary program state π. In dimension d, the hiÀÁ protocol involves a basis of d2 maximally entangled states Φ and 1 pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi À1pffiffiffiffiffi j ii ∇FðπÞ¼ Λà χ χ ΛðπÞ χ 2 χ ; (15) a basis {U } of teleportation unitaries such that TrðUyU Þ¼dδ 2 E E E E i i j ij (ref. 47). An input d-dimensional state ρ and the A part of the λ where k (Pk) are the eigenvalues (eigenprojectors) of the program πAB are subject to the projector jΦiihjΦi . The classical χ χ π π π Hermitian operator π À E . When C1( )orCF( ) are not outcome i is communicated to the B part of AB, where the π À1 differentiable in , then the above expressions provide an element correction Ui is applied. of the subgradient ∂C(π). The above procedure defines the teleportation channel Eπ Once we have the (sub)gradient of the cost function C, we can over ρ π X solve the optimization minπ2SCð Þ, using the projected subgra- tele ρ B ΦSA ρS πAB ΦSA By: 14,39 Eπ ð Þ¼ Ui i i iUi dient method . Let PS be the projection onto the set of (19) i program states S, namely P ðXÞ¼argminπ k X À πk , that we S 2S 2 χ = Λ π show to be computable from the spectral decomposition of any Its Choi matrix can be written as π tele( ), where the map of Hermitian X (see Theorem 3 in the “Methods” section). Then, we the teleportation processor is equal to XÀÁÀÁ iteratively apply the steps Λ π À2 à π à y; teleð Þ¼d Ui Ui Ui Ui (20) ∂ π ; i 1Þ Select an operator gi from Cð iÞ (16) * π π α ; which is clearly self-dual Λ = Λ. Given a target quantum channel E 2Þ iþ1 ¼PSðÞi À igi which is teleportation covariant23,24, namely when α ” π; à where i is the iteration index, i is what is called "learning rate , ½ Ui UiŠ¼0, then we know that its simulation is perfect and and Theorem 2 can be employed to find gi at each step. It is the optimal program π~ is the channel’s Choi matrix, i.e., one of the π π~ simple to show that i converges to the optimal program state fixed points of the map Λtele. For a general channel, the optimal in OðϵÀ2Þ steps, for any desired precision ϵ such that program π~ can be approximated by using the cost functions in our jCðπÞÀCðπ~Þj  ϵ. Another approach is the conjugate gradient Theorem 2 with Λ being given in Eq. (20), or directly found by 15,39 method , sometimes called Frank–Wolfe algorithm. Here, we optimizing C◇(π).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 4 is possible only in the limit N → ∞. Since the standard PBT protocol provides an approximation to the identity channel, we call it I N. From the PBT simulation of the identity channel, it is possible to approximate any general channel E by noting that E can be written as a composition EI, where I is the identity channel. This is done by replacing the identity channel I with its PBT simulation I N, and then applying E to Bi. However, since Bob does not perform any post-processing on his systems B, aside from discarding all ports Bk with k ≠ i, he can also apply first the channel N E to all his ports and then discard all the ports Bk with k ≠ i.In doing so, he changes the program state to Âà π 1 N N Φ N χAk Bk : AB ¼ A EB k¼1 Ak Bk ¼ k¼1 E (25) In other terms, any channel E can be PBT approximated by N χ copies of its Choi matrix E as program state. Since PBT simulation N can be decomposed as Eπ ¼EIN, the error CÅ ¼kEÀEπkÅ in simulating the channel EEIsatisfies N À1 ; CÅ ¼kEIÀEIN kkI ÀINkÅ  2dðd À 1ÞN (26) Fig. 2 PBT scheme. Two distant parties, Alice and Bob, share N maximally entangled pairs fA ; B gN . Alice also has another system where we used the data processing inequality and an upper k k k¼1 48 ’ C in the state ψ . To teleport C, Alice performs the POVM ΠAC on bound from ref. . While the channel s Choi matrix assures that j i f i g N fi N CÅ ! 0 for large N, for any nite N it does not represent the all her local systems A ¼fAkgk¼1 and C. She then communicates the N optimal program state. In general, for any finite N, finding the outcome i to Bob. Bob discards all his systems B ¼fBkgk¼1 with the optimal program state πAB simulating a channel E with PBT is an exception of Bi. After these steps, the state jiψ is approximately teleported to B . Similarly, an arbitrary channel is simulated with open problem, and no explicit solutions or procedures are known. i E fi χAk Bk fi We employ our convex optimization procedures to nd the N copies of the Choi matrix E . The gure shows an example with N = 5, where i = 4 is selected. optimal program state. This can be done either exactly by minimizing the diamond distance cost function C◇ via SDP, or Port-based teleportation approximately, by determining the optimal program state via the minimization of the trace distance cost function C1 via either SDP A deeper design is provided by a PBT processor, whose overall or the gradient-based techniques discussed above. For this second protocol is illustrated in Fig. 2. Here, we consider a more general 19,20 approach, we need to derive the map Λ of the PBT processor, formulation of the original PBT protocol , where the resource between the program state π to output Choi matrix as in Eq. (5). entangled pairs are replaced by an arbitrary program state π.Ina To compute the Choi matrix and CP-map Λ, we consider an input PBT processor, each party has N systems (or ‘ports’), A = {A , …, 1 maximally entangled state jΦ i and a basis jeii of A{B\B }C. Then, A } for Alice and B = {B , …, B } for Bob. These are prepared in a DC j i N 1 N by using Eq. (21) and the definition ΛðπÞ¼χ ¼ 1 Pπ½Φ Š, program state π . To teleport an input state ρ , Alice performs a Pπ D DC AB C we find the map Λ of a PBT processor joint positive operator-value measurement (POVM) {Π } (ref. 19)on AB!DBoutD i X pffiffiffiffiffi system C and the A-ports. She then communicates the outcome i Λ π π y; : i Π Φ : ð Þ¼ Kij Kij Kij ¼ ej i 1BDj DCi (27) to Bob, who discards all ports except Bi, which is the output Bout. ij The resulting PBT channel Pπ : HC 7!HB is then out Note that a general program state for PBT consists of 2N qudits, PN ÂÃpffiffiffiffiffi pffiffiffiffiffi 4N ρ Π π ρ Π and hence the parameter space has exponential size d . Pπð Þ¼ Tr ið AB CÞ i ABi C Bi !Bout However, because the PBT protocol is symmetric under permuta- i¼1 (21) PN tion of port labels, we show in Supplementary Note 6 that one can ¼ Tr ½ŠΠ ðπ ρ Þ ; exploit this symmetry and reduce the number of free parameters ABi C i AB C Bi !Bout  i¼1 4 fi N þ d À 1 : ≠ to the binomial coeff cient 4 , which is polynomial in where Bi ¼ BnBi ¼fBk k ig. d À 1 19,20 In theN standard PBT protocol , the program state is fixed as the number of ports N. Despite this exponential reduction, the π ¼ N Φ , where jiΦ are Bell states, and the following AB k¼1 Ak Bk Ak Bk scaling in the number of parameters still representsÀÁ a practical POVM is used 15 ! limiting factor, even for for which O N . A suboptimal X strategy consists in reducing the space of program states to a ~ 1 ~ Πi ¼ Πi þ 1 À Πk ; (22) that we call the “Choi space” C. Consider an arbitrary N k probability distribution {p } and then define X k where π : π ρk N; ρk À11 : C¼f ¼ pk AB TrBð ABÞ¼d g = = (28) Π~ σÀ1 2Φ σÀ1 2; (23) k i ¼ AC Ai C AC One can show (see Supplementary Note 6) that a global minimum XN σ : Φ ; in C is a global minimum in the extremal (non-convex) subspace AC ¼ Ai C (24) N for pk = δk,1, consisting of tensor products of Choi matrices ρ . i¼1 AB −1/2 Among these states, there is the N-copy Choi matrix of the target and σ is an operator defined only on the support of σ. The PBT N channel χ N ¼½I EðjΦihΦ jފ , which is not necessarily the protocol is formulated for N ≥ 2 ports. However, we also include E optimal program, as we show below. here the trivial case for N = 1, corresponding to the process where Alice’s input is traced out and the output is the reduced state of Bob’s port, i.e., a maximally mixed state. In the limit N → ∞, the Parametric quantum circuits standard PBT protocol approximatesÀÁ an identity channel Another deep design of quantum processor is based on ρ ρ fi 19,21 1 22,49 Pπð Þ , with delity Fπ ¼ 1 ÀO N , so perfect simulation PQCs (refs. ). A PQC is a sequence of unitary matrices

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 5

U(t) = UN(tN)…U2(t2)U1(t1), where UjðtjÞ¼expðitjHjÞ for some where 0 is a zero operator, so that gate j enacts the identity Hamiltonian Hj and time interval tj. The problem with PQCs is channel if program j is in the statej 2ihj2 . Then, if it were the that the cost functions in the classical parameters42 are not case that a PQC with N program qubits could simulate a given convex, so that numerical algorithms are not guaranteed to channel better than one with N + m, a mPQC with N + m in converge to the global optimum. Here, we fix this issue by the program state could perform at least, as well as the PQC with introducing a convex formulation of PQCs, where classical N program qubits by setting the first m qutrits toj 2ihj2 . This parameters are replaced by a quantum program. This results in a processor design is both universal and monotonic. More precisely, programmable PQC processor that is optimizable by our methods. let C(PQCN) denote the value of a cost function C for simulating a The universality of PQCs can be employed for universal channel channel E with an N-gate PQC, using the optimal program state, simulation. Indeed, thanks to Stinespring’s dilation theorem, any and let C(mPQCN) denote the value of C for simulating E with an channel can be written as a unitary evolution on a bigger space, N-gate mPQC, again using the optimal program state. We are then ρ ρ θ y ; Eð AÞ¼TrR0 ½Uð A 0ÞU Š where the system is paired to an extra guaranteed that θ register R0 and 0 belongs to R0. In the Stinespring representation, : 49 CðmPQCNÞmin CðPQCMÞ (32) U acts on system A and register R0. In ref. , it has been shown MN that sequences of two unitaries, U0 and U1, are almost universal for simulation, i.e., any target unitary U can be approximated as U  m4 m3 m2 m1 Processor benchmarking ÁÁÁU1 U0 U1 U0 for some integers mj. Under suitable conditions, it takes Oðd2ϵÀdÞ steps to approximate U up to precision ϵ. The In order to show the performance of the various architectures, we choice between U0 and U1 is done by measuring a classical bit. We consider the simulation of an amplitude damping channel with may introduce a quantum version, where the two different probability p. The reason is because this is the most difficult iH iH unitaries U0 ¼ e 0 or U1 ¼ e 1 are chosen depending on the state channel to simulate, with a perfect simulation only known for fi of qubit Rj. This results in the conditional gate in nite dimension, e.g., using continuous-variable quantum  operations23. In Figs. 4 and 5, we compare teleportation-based, ^ : “ ” Uj ¼ exp iH0 j0ijjhj0 þ iH1 j1ijjhj1 (29) PBT, PQC, and mPQC programmable processors whose program states have been optimized according to the cost functions C◇ Channel simulation is then obtained by replacing the unitary and C1. For the PBT processor, the trace distance cost C1 is evolution U in the Stinespring dilation via its simulation. The result remarkably close to C◇ and allows us to easily explore high is illustrated in Fig. 3, where the program state π is defined over R depths. Note that the optimal program states differ from the naive ^ = (R0, …, RN) and each Hj acts on the input system A and two choice of the Choi matrix of the target channel. Note too that PQC ancillary qubits R0 and Rj. Following the universality construction processors display non-monotonic behavior when simulating of ref. 49, we show in the Supplementary Note 3.4 that the channel amplitude damping channels, meaning that shallow PQC proces- shown in Fig. 3 provides a universal processor. Moreover, the sors (e.g., for N = 4) may perform better than deeper processors. Λ π ’ channel that maps any program to the processor s Choi matrix pForffiffiffi the PQC processor, we usep theffiffiffi universalpffiffiffi p Hamiltoniansffiffiffi pHffiffiffi0 ¼ is obtained as 2ðX Y À Y XÞ and H1 ¼ð 2Z þ 3Y þ 5XÞ ðY þ 2ZÞ, hi where X, Y, and Z are Pauli operators. mPQC processors guarantee Λ π ^ Φ π ^y ; ð Þ¼TrR UARðÞBA R UAR (30) that deeper designs always perform at least, as well as any Q ^ N ^ shallower design. In Fig. 5 perfect simulation is achievable at where UAR ¼ 1B Uj ; ; , from which we can identify the j¼1 A R0 Rj specific values of p because of our choice of the universal gates U optimal program jiπ~ via our methods. 0 and U1. More details are provided in the Supplementary Note 3. PQCs are not inherently monotonic. A deeper (higher N) design Many other numerical simulations are performed in the may simulate a given channel worse than a more shallow design. Supplementary Note 3 where we study the convergence rate in fi We can design a modi ed PQC that is monotonic by design, which learning a unitary operation, the exact simulation of Pauli we designate a “monotonic PQC” (mPQC), by replacing the qubits in our program state with qutrits, and modifying Eq. (29) to read  ^ ; Uj ¼ exp iH0 j0ijjhj0 þ iH1 j1ijjhj1 þ 0 j2ijjhj2 (31)

Fig. 4 Diamond distance error CÅ in simulating an amplitude damping channel Ep at various damping rates p. We compare the performance of different designs for the programmable quantum processor: standard teleportation and PBT with N ports (PBTN). The optimal program π~ is obtained by either minimizing directly the diamond distance CÅ (solid lines), or the trace distance C1 (dashed Fig. 3 PQC Scheme. Thanks to the Stinespring decomposition, a lines) via the projected subgradient iteration. In both cases, from π~ ~ quantum channel is written as a unitary evolution in an extended we then compute CÅðπÞ. The lowest curves are obtained by space. The corresponding unitary is then simulated via a parametric optimizing π over the Choi space in Eq. (28). For comparison, we that, at each time, applies a certain unitary also show the (non-optimal) performance when the program is the depending on the state of the quantum register. channel’s Choi matrix (dotted lines).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 6 approximates the overall quantum channel to be applied to the input. The server then accepts the input from the client, processes it, and returns the output together with the value of a cost function quantifying how close the computation was with respect to the client’s request. Our results may also be useful in areas beyond quantum computing, wherever channel simulation is a basic problem. For instance, this is the case when we investigate the ultimate limits of quantum communications24, design optimal Hamiltonians for one-way quantum repeaters, and for all those areas of quantum sensing, hypothesis testing and metrology that are based on quantum channel simulations 51. Indeed the study of adaptive protocols of quantum channel discrimination (or estimation) is notoriously difficult, and their optimal performance is not

Fig. 5 Diamond distance error C◇ in simulating an amplitude completely understood. Nonetheless, these protocols can be 48,51 damping channel Ep at various damping rates p. We compare the analyzed by using simulation techniques where the channel, performance of two different designs for the programmable encoding the unknown parameter, is replaced by an approximate quantum processor: PQCs with N + 1 registers (PQCN) and mono- simulating channel, and its parameter is mapped into the label of + tonic parametric quantum circuits with N 1 registers (mPQCN). In a program state (therefore reducing the problem from channel to π~ both cases the optimal program j i is obtained by minimizing the state discrimination/estimation). In this regard, our theory diamond distance C◇. provides the optimal solution to this basic problem, by determin- ing the best simulating channel and the corresponding channels, and approximate simulation of both dephasing and program state. amplitude damping channels. In particular, we study the performance of the approximate solution when optimizing over larger, but easier-to-compute, cost functions, such as the trace METHODS fi distance or the in delity. Convexity proofs In this section, we provide a proof of Theorem 1, namely we show that the DISCUSSION minimization of the main cost functions C◇, C1, and CF is a convex optimization problem in the space of the program states π. This means fi In this work, we have considered a general nite-dimensional that we can find the optimal program state π~ by minimizing C◇ or, model of a programmable quantum processor, which is a alternatively, suboptimal program states can be found by minimizing fundamental scheme for quantum computing and also a primitive either C1 or CF. For the sake of generality, we prove the result for all of the tool for other areas of quantum information. By introducing cost functions discussed in the previous sections. We restate Theorem 1 suitable cost functions, based on the diamond distance, trace below for completeness: = distance and quantum fidelity, we have shown how to character- Theorem The minimization of the generic cost function C C◇, C1, CF, CR ize the optimal performance of this processor in the simulation of or Cp for any p >1is a convex optimization problem in the space of program states. an arbitrary quantum gate or channel. In fact, we have shown that Proof Let us start to show the result for the diamond distance C◇. In this the minimization of these cost functions is a convex optimization case, we can write the following problem that can always be solved. π π0 In particular, by minimizing the diamond distance via SDP, we CŽp þð1 À pÞ Š

can always determine the optimal program state for the : π π0 ¼EÀEp þð1ÀpÞ Å simulation of an arbitrary channel. Alternatively, we may minimize ð1Þ the simpler but larger cost functions in terms of trace distance and ¼ kkðp þ 1 À pÞE À pEπ Àð1 À pÞEπ0 Å quantum fidelity via gradient-based methods adapted from ML, so ð2Þ (33)  kkpEÀpEπ þðkk1 À pÞE À ð1 À pÞEπ0 as to provide a very good approximation of the optimal program Å Å state. This other approach can also provide closed analytical ð3Þ  pkkEÀEπ Å þð1 À pÞEÀEkkπ0 Å solutions, as is the case for the simulation of arbitrary unitaries, for ¼ pC ðπÞþð1 À pÞC ðπ0Þ; which the minimization of the fidelity cost function corresponds Å Å to computing an eigenvector. where we use (1) the linearity of E, (2) the triangle inequality, and (3) the We have then applied our results to various designs of property ∥xA∥1 = ∣x∣∥A∥1, valid for any operator A and coefficient x. ≥ programmable quantum processor, from a shallow teleportation- For any Schatten p-norm Cp with p 1, we may prove convexity π : π π based scheme to deeper and asymptotically universal designs that following a similar reasoning. Since for any combination ¼ p0 0 þ p1 1, + = Λ π Λ π Λ π are based on PBT and PQCs. We have explicitly benchmarked the with p0 p1 1, we have ð Þ¼p0 ð 0Þþp1 ð 1Þ, then by exploiting the triangle inequality, and the property ∥xA∥ = ∣x∣∥A∥ , we can show that performances of these quantum processors by considering the p p simulation of unitary gates, depolarizing and amplitude damping π π : χ Λ π π Cpðp0 0 þ p1 1Þ ¼k E À ðp0 0 þ p1 1Þkp channels, showing that the optimal program states may differ  p k χ À Λðπ0Þk þ p k χ À Λðπ1Þk (34) from the naive choice based on the Choi matrix of the target 0 E p 1 E p ¼ p C ðπ Þþp C ðπ Þ : channel. Moreover, our results can be applied also for universal 0 p 0 1 p 1 50 quantum measurements . To show the convexity of CF,defined in Eq. (8), we note that the fidelity A potential application of our work may be the development of function F(ρ, σ) satisfies the following concavity relation52 “programmable” model of cloud-based quantum computation, ! where a client has an input state to be processed by an online X 2 X ρ ; σ ρ ; σ 2 : (35) quantum server that is equipped with a programmable quantum F pk k  pk Fð k Þ processor. The client classically informs the server about what type k k fi χ = Λ π fi fi 2 of computation it needs (e.g., some speci ed ) PDue to the linearityP of π ( ), the delity in Eq. (9) satis es Fπ  and the server generates an optimal program state that closely p F2 for π : p π . Accordingly, we get the following convexity k k πk ¼ k k k

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 7 result C is not differentiable, the gradient still provides an element of the ! X X subgradient that can be used in the minimization algorithm. ∇ C p π  p C ðπ Þ : (36) In order to compute the gradient C, it is convenient to consider the F k k k F k Kraus decomposition of the processor map Λ. Let us write k k X Λ π Λ π π y ; For the cost function CR, the result comes from the linearity of ( ) and the ð Þ¼ Ak Ak (41) π : π π joint convexity of the relative entropy. In fact, for ¼ p0 0 þ p1 1,we k may write * with Kraus operators Ak. We then define the dual map Λ of the processor Λ π χ Λ π Λ π χ S½ ð Þjj E Š¼ S½p0 ð 0Þþp1 ð 1Þjj E Š as the one (generally non-trace-preserving), which is given by the Λ π Λ π χ χ following decomposition ¼ S½p0 ð 0Þþp1 ð 1Þjjp0 E þ p1 E Š (37) X  p S½Λðπ Þ; χ Šþp S½Λðπ Þ; χ Š; Λà ρ y ρ : 0 0 E 1 1 E ð Þ¼ Ak Ak (42) χ Λ π π k with a symmetric proof for S½ E jj ð ފ. This implies the convexity of CR( )in Eq. (10).■ With these definitions in hands, we can now prove Theorem 2, which we rewrite here for convenience. Theorem Suppose we use a quantum processor Q with map Λ(π) = χπ in Convex classical parametrizations χ order to approximate the Choi matrix E of an arbitrary channel E. Then, the The result of the Theorem 1 can certainly be extended to any convex gradients of the trace distance C1(π) and the infidelity CF(π) are given by the parametrization of program states. For instance, assume that π = π(λ), following analytical formulas where λ = {λi} is a probability distribution. This means that, for 0 ≤ p ≤ 1 and X λ λ0 ∇ π λ Λà ; any two parametrizations, and , we may write C1ð Þ¼ signð k Þ ðPkÞ (43) k π½Š¼pλ þð1 À pÞλ0 pπðλÞþð1 À pÞπðλ0Þ: (38) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Then the problem remains convex in λ and we may therefore find the ∇CF ðπÞ¼À2 1 À CF ðπÞ∇FðπÞ; (44) global minimum in these parameters. It is clear that this global minimum ~λ fi π ~λ π~ hiÀÁ identi es a program state ð Þ that is not generally the optimal state in 1 pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi À1 pffiffiffiffiffi ∇FðπÞ¼ Λà χ χ ΛðπÞ χ 2 χ ; (45) the entire program space S, even though the solution may be a 2 E E E E convenient solution for experimental applications. λ Note that a possible classical parametrization consists of using classical where k (Pk) are the eigenvalues (eigenprojectors) of the Hermitian operator χ χ π π π program states, of the form π À E . When C1( ) or CF( ) are not differentiable at , then the above X expressions provide an element of the subgradient ∂C(π). π λ λ φ φ ; Proof ð Þ¼ ijii hji (39) We prove the above theorem assuming that the functions are i differentiable for program π. For non-differentiable points, the only φ difference is that the above analytical expressions are not unique and where fj iig is an orthonormal basis in the program space. Convex combinations of probability distributions therefore define a convex set of provide only one of the possibly infinite elements of the subgradient. classical program states Further details of this mathematical proof are given in Supplementary Note X 4. Following matrix differentiation, for any function fðAÞ¼Tr½gðAފ of a π : π λ φ φ ; φ φ δ : Sclass ¼f ¼ ijii hji hji ji¼ ijg (40) matrix A, we may write i d Tr ½gðAފ ¼ Tr ½g0ðAÞdAŠ; (46) Optimizing over this specific subspace corresponds to optimizing the programmable quantum processor over classical programs. It is clear that and the gradient is ∇fðAÞ¼g0ðAÞ. Both the trace distance and fidelity cost global minima in Sclass and S are expected to be very different. For functions can be written in this form. To find the explicit gradient of the instance, Sclass cannot certainly include Choi matrices that are usually very fidelity function, we first note that, by linearity, we may write good quantum programs. Λðπ þ δπÞ¼ΛðπÞþΛðδπÞ ; (47) Gradient-based optimization and therefore the following expansion pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi As discussed in the main text, the SDP formulation allows the use of χ Λðπ þ δπÞ χ ¼ χ ΛðπÞ χ þ χ ΛðδπÞ χ : (48) powerful and accurate numerical methods, such as the interior point E E E E E E method. However, these algorithms are not suitable for high-dimensional From this equation and differential calculations of the fidelity (see problems, due to their higher computational and memory requirements. Supplementary Note 4.2 for details), we find Therefore, an alternative approach (useful for larger program states) hiffiffiffiffiffi ffiffiffiffiffi ffiffiffiffiffi ffiffiffiffiffi 1 p p À1 p p consists of the optimization of the larger but easier-to-compute cost dF ¼ Tr ð χ ΛðπÞ χ Þ 2 χ ΛðδπÞ χ ; (49) 2 E E E E function C = C1 (trace distance) or CF (infidelity), for which we can use first- order methods. Indeed, according to Theorem 1, all of the proposed cost where dF = F(π + δπ) − F(π). Then, using the cyclic property of the trace, functions C : S!R are convex over the program space S and, therefore, we get we can solve the optimization minπ2SCðπÞ by using gradient-based hihi 1 pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi 1 pffiffiffiffiffi algorithms. Λà χ χ Λ π χ À2 χ δπ : dF ¼ Tr E ð E ð Þ E Þ E (50) Gradient-based convex optimization is at the heart of many popular ML 2 17 techniques, such as online learning in a high-dimensional feature space , Exploiting this expression in Eq. (46), we get the gradient ∇F(π) as in Eq. 18 missing value estimation problems , text classification, image ranking, (45). The other Eq. (44) simply follows from applying the definition in Eq. 53 and optical character recognition , to name a few. In all of the above (8). applications, "learning” corresponds to the following minimization For the trace distance, let us write the eigenvalue decomposition problem: minx2S fðxÞ, where f(x) is a convex function and S is a convex X χ χ λ : set. Quantum learning falls into this category, as the space of program π À E ¼ k Pk (51) states is convex due to the linearity of and the fact k that cost functions are typically convex in this space (see Theorem 1). Then using the linearity of Eq. (47), the definition of a processor map of Eq. Gradient-based approaches are among the most applied methods for 39 (5) and differential calculations of the trace distance (see Supplementary convex optimization of non-linear, possibly non-smooth functions . Note 4.3 for details), we can write When the cost function is not differentiable, we cannot formally define P its gradient. Nonetheless, we can always define the subgradient ∂C of C as dC1ðπÞ¼ signðλk Þ Tr ½Pk Λðdπފ in Eq. (12), which in principle contains many points. When C is not only Pk à convex but also differentiable, then ∂C(π) = {∇C(π)}, i.e., the subgradient ¼ signðλk Þ Tr ½Λ ðPk ÞdπŠ (52) contains a single element, the gradient ∇C, that can be obtained via the k Λà χ χ π : Fréchet derivative of C (for more details see Supplementary Note 4). When ¼ Tr fg½signð π À E ފd

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 8 From the definition of the gradient in Eq. (46), we finally get For any unitarily invariant norm, the following inequality holds (ref. 55, Eq. ∇ π Λà χ χ ; IV.64): C1ð Þ¼ ½signð π À E ފ (53) kX À πk kx À λk ; (60) which leads to the result in Eq. (43). ■ 2 2 The above results in Eqs. (44) and (43) can be used together with the with equality when U = V, where X = UxU† is a spectral decomposition of X 14 15,16 projected subgradient method or conjugate gradient algorithm to such that the xj’s are in decreasing order. This shows that the optimal iteratively find the optimal program state in the minimization of unitary in Eq. (59) is the diagonalization matrix of the operator X. The minπ2S CðπÞ for C = C1 or CF. In the following sections, we present two eigenvalues of any density operator form a probability simplex. algorithms, the projected subgradient method and the conjugate gradient The optimal eigenvalues λ are then obtained thanks to Algorithm 1 from method, and show how they can be adapted to our problem. ref. 17. ■ Projected subgradient methods have the advantage of simplicity and In the following section, we present an alternative algorithm with faster the ability to optimize non-smooth functions, but can be slower, with a convergence rates, but stronger requirements on the function to be convergence rate OðÞϵÀ2 for a desired accuracy ϵ. Conjugate gradient optimized. methods15,16 have a faster convergence rate OðÞϵÀ1 , provided that the cost functionÀÁ is smooth. This convergence rate can be improved even further to O ϵÀ1=2 for strongly convex functions54 or using Nesterov’s Conjugate gradient method 15,39 accelerated gradient method41. The technical difficulty in the adaptation of The conjugate gradient method , sometimes called the Frank–Wolfe these methods for learning program states comes because the latter is a algorithm, has been developed to provide a better convergence speed and constrained optimization problem, namely at each iteration step the to avoid the projection step at each iteration. Although the latter can be optimal program must be a proper quantum state, and the cost functions explicitly computed for quantum states (thanks to our Theorem 3), having coming from quantum information theory are, generally, non-smooth. a faster convergence rate is important, especially with higher dimensional Hilbert spaces. The downside of this method is that it necessarily requires a differentiable cost function C, with gradient ∇C. Projected subgradient method In its standard form, the conjugate gradient method to approximate the fi π fi Given the space S of program states, let us de ne the projection PS onto solution of argminπ2SCð Þ is de ned by the following iterative rule S as σ∇ π ; 1Þ Find argminσ2S Tr ½ Cð iފ π ; (61) PSðXÞ¼argmin kX À k2 (54) 2Þ π ¼ π þ 2 ðσ À π Þ¼ i π þ 2 σ: π2S iþ1 i iþ2 i iþ2 i iþ2 where argmin is the argument of the minimum, namely the closest state The first step in the above iteration rule is solved by finding the smallest σ ∇ π π π π 2Sto the operator X. Then, a first-order algorithm to solve minπ2S CðπÞ eigenvector j i of C( i). Indeed, since is an operator and C( ) a scalar, is to apply the projected subgradient method14,39, which iteratively applies the gradient ∇C is an operator with the same dimension as π. Therefore, the iteration (16), which we rewrite below for convenience for learning quantum programs, we find the iteration (17), that we rewrite below for convenience ∂ π ; 1Þ Select an operator gi from Cð iÞ (55) σ ∇ π ; π π α ; 1Þ Find the smallest eigenvalue jii of Cð iÞ 2Þ Update iþ1 ¼PSðÞi À igi i 2 (62) where i is the iteration index and α a learning rate. 2Þ πiþ1 ¼ πi þ jσiihjσi : i i þ 2 i þ 2 The above algorithm differs from standard gradient methods in two aspects: (i) the update rule is based on the subgradient, which is defined When the gradient of C is Lipschitz continuous with constant L, the 16,41 even for non-smooth functions; (ii) the operator π − α g is generally not a conjugate gradient method converges after OðL=ϵÞ steps . The i i i α quantum state, so the algorithm fixes this issue by projecting that operator following iteration with adaptive learning rate i has even faster 54 back to the closest quantum state, via Eq. (54). The algorithm converges to convergence rates, provided that C is strongly convex : π π~ 14 the optimal solution * (approximating the optimal program )as σ ∇ π ; P 1Þ Find the smallest eigenvalue jii of Cð iÞ e G i α2 α α τ ; ∇ π π π 1 þ k¼1 k : ϵ; 2Þ Find i ¼ argminα2½0;1Š h i Cð iÞi Cð iÞÀCð ÃÞ P ¼ (56) (63) i α β 2 2 k¼1 k α2 C τ ; τ σ σ π ; þ 2 k ikC for i ¼ j iihji À i π π 2 where e1 ¼k 1 À Ãk2 is the initial error (in Frobenius norm) and G is 3Þ πiþ1 ¼ð1 À αiÞπi þ αijσiihjσi : 2 ∈ ∂ such that kgk2  G for any g C.p Popularffiffiffi choices for the learning rate 54 where the constant βC and norm ∥⋅∥C depend on C (ref. ). that assure convergence are αk / 1= k and αk = a/(b + k) for some a, b > 0. In spite of the faster convergence rate, conjugate gradient methods ∇ fi In general, the projection step is the major drawback, which often limits require smooth cost functions (so that the gradient C is well de ned at the applicability of the projected subgradient method to practical every point). However, cost functions based on trace distance (7) are not problems. Indeed, projections like Eq. (54) require another full optimization smooth. For instance, the trace distance in one-dimensional spaces ∣ ∣ = at each iteration that might be computationally intensive. Nonetheless, we reduces to the absolute value function x that is non-analytic at x 0. show in the following theorem that this issue does not occur in learning When some eigenvalues are close to zero, conjugate gradient methods quantum states, because the resulting optimization can be solved may display unexpected behaviors, though we have numerically observed analytically. that convergence is always obtained with a careful choice of the learning Theorem 3 Let X be a Hermitian operator in a d-dimensional rate. In the next section, we show how to formally justify the applicability † ’ with spectral decomposition X = UxU , where the eigenvalues x are ordered in of the conjugate gradient method, following Nesterov s smoothing j prescription41. decreasing order. Then PS ðXÞ of Eq. (54) is given by y PSðXÞ¼UλU ; λi ¼ maxfxi À θ; 0g; (57) Ps ÀÁ Smoothing: smooth trace distance 1 where θ ¼ xj À 1 and ()s TheÀÁ conjugate gradient method converges to the global optimum after j¼1 O L steps, provided that the gradient of C is L-Lipschitz continuous41. 1 Xk ÀÁ ϵ s ¼ max k 2½1; :::; dŠ : x > x À 1 : (58) However, the constant L can diverge for non-smooth functions like the k k j j¼1 trace distance (7) so the convergence of the algorithm cannot be formally stated, although it may still be observed in numerical simulations. To Proof Any quantum (program) state can be written in the diagonal form † solidify the convergence proof (see also Supplementary Note 5.2), we π = VλV , where V is a unitary matrix, and λ is the vector of eigenvalues in fi λ ≥ ∑ λ = fi introduce a smooth approximation to the trace distance. This is de ned by decreasing order, with j 0 and j j 1. To nd the optimal state, it is the following cost function that is differentiable at every point required to find both the optimal unitary V and the optimal eigenvalues λ ÂÃX with the above property, i.e., π χ χ λ ; Cμð Þ¼Tr hμðÞπ À E ¼ hμð jÞ (64) λ y : j PSðXÞ¼argmin kX À V V k2 (59) V;λ λ χ χ where j are the eigenvalues of π À E and hμ is the so-called Huber

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 9 penalty function 15. Jaggi, M. Convex optimization without projection steps. Preprint at https://arxiv. ( 2 org/abs/1108.1170 (2011). x if j xj < μ ; : 2μ 16. Jaggi, M. Revisiting frank-wolfe: projection-free sparse convex optimization. In hμðxÞ ¼ μ (65) jxjÀ if j xjμ : Proceedings of the 30th International Conference on International Conference on 2 – fi Machine Learning, Vol 28 I 427(2013). The previous de nition of the trace distance, C1 in Eq. (7), is recovered for fi μ → μ 17. Duchi, J., Shalev-Shwartz, S., Singer Y. & Chandra, T. Ef cient projections onto the 0 and, for any non-zero , the Cμ bounds C1 as follows l 1-ball for learning in high dimensions. In Proceedings of the 25th international μd conference on Machine learning, 272–279 (ACM, 2008). CμðπÞC ðπÞCμðπÞþ ; (66) 1 2 18. Liu, J., Musialski, P., Wonka, P. & Ye, J. Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 208–220 (2013). where d is the dimension of the program state π. In Supplementary Note 19. Ishizaka, S. & Hiroshima, T. Asymptotic teleportation scheme as a universal pro- 5.2, we then prove the following result grammable quantum processor. Phys. Rev. Lett. 101, 240501 (2008). Theorem 4 The smooth cost function Cμ(π) is a convex function over 20. Ishizaka, S. & Hiroshima, T. Quantum teleportation scheme by selecting one of program states and its gradient is given by multiple output ports. Phys. Rev. A 79, 042306 (2009). ∇ π Λà 0 χ χ ; Cμð Þ¼ ½hμð π À E ފ (67) 21. Ishizaka, S. Some remarks on port-based teleportation. Preprint at https://arxiv. 0 org/abs/1506.01555 (2015). where hμ is the derivative of hμ. Moreover, the gradient is L-Lipschitz 22. Lloyd, S. Universal quantum simulators. Science 273, 1073–1078 (1996). continuous with 23. Pirandola, S., Laurenza, R., Ottaviani, C. & Banchi, L. Fundamental limits of d ; repeaterless quantum communications. Nat. Commun. 8, 15043 (2017). L ¼ μ (68) 24. Pirandola, S. et al. Theory of channel simulation and bounds for private com- munication. Quant. Sci. Tech 3, 035009 (2018). where d is the dimension of the program state. 25. Nechita, I., Puchała, Z., Pawela, Ł.&Życzkowski, K. Almost all quantum channels Being Lipschitz continuous, the conjugate gradient algorithm and its 41,54 ϵ =ϵ are equidistant. J. Math. Phys. 59, 052201 (2018). variants converge up to an accuracy after OðL Þ steps. In some 26. Fuchs, C. A. & van de Graaf, J. Cryptographic distinguishability measures for applications, it is desirable to analyze the convergence in trace distance in – → ∞ μ quantum-mechanical states. IEEE Trans. Info. Theory 45, 1216 1227 (1999). the limit of large program states, namely for d . The parameter can 27. Pinsker, M. S. Information and information stability of random variables and pro- be chosen such that the smooth trace distance converges to the trace cesses (Holden-Day, San Francisco, 1964). distance, namely Cμ → C1 for d → ∞. Indeed, given the inequality (66), a Àð1þηÞ 28. Carlen, E. A. & Lieb, E. H. Bounds for entanglement via an extension of strong possibility is to set μ ¼Oðd Þ for some η > 0 so that, from Eq. (68), the – 2þη subadditivity of entropy. Lett. Math. Phys. 101,1 11 (2012). convergence to the trace norm is achieved after Oðd Þ steps. 29. Watrous, J. Semidefinite programs for completely bounded norms. Theory Comput. 5, 217–238 (2009). 30. Watrous, J. Simpler semidefinite programs for completely bounded norms. Chi- DATA AVAILABILITY cago J. Theor. Comput. Sci. 8,1–19 (2013). The datasets generated and analyzed during the current study are available from the 31. Vandenberghe, L. & Boyd, S. Semidefinite programming. SIAM Rev. 38,49–95 corresponding author on reasonable request. (1996). 32. Chao, H.-H. First-Order Methods for Trace Norm Minimization (University of Cali- fornia, Los Angeles, 2013). CODE AVAILABILITY 33. Monteiro, R. D. C. First-and second-order methods for semidefinite programming. – The codes used for this study are available from the corresponding author on Math. Program. 97, 209 244 (2003). reasonable request. 34. Spall, J. C. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Trans. Automat. Contr. 45, 1839–1853 (2000). 35. Zhuang, Q. & Zhang, Z. Physical-layer supervised learning assisted by an Received: 21 November 2019; Accepted: 25 March 2020; entangled sensor network. Phys. Rev. X 9, 041023 (2019). 36. Harrow, A. & Napp, J. Low-depth gradient measurements can improve con- vergence in variational hybrid quantum-classical algorithms. Preprint at https:// arxiv.org/abs/1901.05374 (2019). 37. Cai, J.-F., Candès, E. J. & Shen, Z. A singular value thresholding algorithm for REFERENCES matrix completion. SIAM J. Optimiz. 20, 1956–1982 (2010). 1. Nielsen, M. A. & Chuang, I. L. Programmable quantum gate arrays. Phys. Rev. Lett. 38. Recht, B., Fazel, M. & Parrilo, P. A. Guaranteed minimum-rank solutions of linear – 79, 321 (1997). matrix equations via nuclear norm minimization. SIAM Rev. 52, 471 501 (2010). 2. Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information 39. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course, Vol 87 (Cambridge University Press, Cambridge, 2000). (Springer Science & Business Media, New York, 2013). 3. Watrous, J. The Theory of Quantum Information (Cambridge Univ. Press, 2018). 40. Coutts, B., Girard, M. & Watrous, J. Certifying optimality for convex quantum 4. Knill, E., Laflamme, R. & Milburn, G. J. A scheme for efficient quantum compu- channel optimization problems. Preprint at https://arxiv.org/abs/1810.13295 tation with linear optics. Nature 409, 46 (2001). (2018). 5. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006). 41. Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program. 103, – 6. Wittek, P. Learning: What Quantum Computing Means to Data 127 152 (2005). Mining (Academic Press, Elsevier, 2014). 42. Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T. & Glaser, S. J. Optimal 7. Biamonte, J. et al. . Nature 549, 195 (2017). control of coupled dynamics: design of nmr pulse sequences by gradient – 8. Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum ascent algorithms. J. Magn. Reson. 172, 296 305 (2005). domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018). 43. Banchi, L., Pancotti, N. & Bose, S. Quantum gate learning in qubit networks: toffoli 9. Schuld, M., Sinayskiy, I. & Petruccione, F. An introduction to quantum machine gate without time-dependent control. npj Quantum Info. 2, 16019 (2016). learning. Contemp. Phys. 56, 172–185 (2015). 44. Innocenti, L. Banchi, L. Ferraro, A. Bose S. & M. Paternostro, M. Supervised learning 10. Ciliberto, C. et al. Quantum machine learning: a classical perspective. Proc. R. Soc. of time-independent hamiltonians for gate design. New J. Phys. (in press) https:// A 474, 20170551 (2018). doi.org/10.1088/1367-2630/ab8aaf (2020). 11. Tang, E. A quantum-inspired classical algorithm for recommendation systems. In 45. Mitarai, K., Negoro, M., Kitagawa, M. & Fujii, K. Quantum circuit learning. Phys. Rev. Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Com- A 98, 032309 (2018). puting (STOC 2019). Association for Computing Machinery, New York, NY, USA, 46. Bennett, C. H. et al. Teleporting an unknown quantum state via dual classical and 217–228. https://doi.org/10.1145/3313276.3316310 (2019). einstein-podolsky-rosen channels. Phys. Rev. Lett. 70, 1895 (1993). 12. Tang, E. Quantum-inspired classical algorithms for principal component analysis 47. Pirandola, S., Eisert, J., Weedbrook, C., Furusawa, A. & Braunstein, S. L. Advances in – and supervised clustering. Preprint at https://arxiv.org/abs/1811.00414 (2018). quantum teleportation. Nat. . 9, 641 652 (2015). 13. Kitaev, A. Y., Shen, A. & Vyalyi, M. N. Classical and Quantum Computation,47 48. Pirandola, S., Laurenza, R., Lupo, C. & Pereira, J. L. Fundamental limits to quantum (American Mathematical Society, Providence, Rhode Island, 2002). channel discrimination. npj Quantum Info 5, 50 (2019). 14. Boyd, S., Xiao, L. & Mutapcic, A. “Subgradient methods.” lecture notes of EE392o, 49. Lloyd, S. Almost any gate is universal. Phys. Rev. Lett. 75, 346 Stanford University, Autumn Quarter 2004 (2003):2004–2005. (1995).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 10 50. D’Ariano, G. M. & Perinotti, P. Efficient universal programmable quantum mea- ADDITIONAL INFORMATION surements. Phys. Rev. Lett. 94, 090401 (2005). Supplementary information is available for this paper at https://doi.org/10.1038/ 51. Pirandola, S., Bardhan, B. R., Gehring, T., Weedbrook, C. & Lloyd, S. Advances in s41534-020-0268-2. photonic quantum sensing. Nat. Photon 12, 724–733 (2018). 52. Uhlmann, A. The transition probability. Rep. Math. Phys. 9, 273–279 (1976). Correspondence and requests for materials should be addressed to L.B. 53. Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). Reprints and permission information is available at http://www.nature.com/ 54. Garber, D. & Hazan, E. Faster rates for the frank-wolfe method over strongly- reprints convex sets. In Proceedings of the 32nd International Conference on International – Conference on Machine Learning, Vol 37, 541 549 (2015). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims 55. Bhatia, R. Matrix Analysis, Vol 169 (Springer Science & Business Media, New York, 2013). in published maps and institutional affiliations.

ACKNOWLEDGEMENTS L.B. acknowledges support by the program "Rita Levi Montalcini” for young researchers. S.P. and J.P. acknowledge support by the EPSRC via the ‘UK Quantum Open Access This article is licensed under a Creative Commons Communications Hub’ (Grants EP/M013472/1 and EP/T001011/1), and S.P. acknowl- Attribution 4.0 International License, which permits use, sharing, edges support by the European Union via the project ‘Continuous Variable Quantum adaptation, distribution and reproduction in any medium or format, as long as you give Communications’ (CiViQ, no 820466). appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless AUTHOR CONTRIBUTIONS indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory All authors contributed to prove the main theoretical results and to the writing of the regulation or exceeds the permitted use, you will need to obtain permission directly manuscript. from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. COMPETING INTERESTS The authors declare no competing interests © The Author(s) 2020

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales