Convex Optimization of Programmable Quantum Computers

This is a repository copy of Convex optimization of programmable quantum computers.

White Rose Research Online URL for this paper: https://eprints.whiterose.ac.uk/160963/

Version: Published Version

Article: Banchi, Leonardo, Pereira, Jason, Lloyd, Seth et al. (1 more author) (2020) Convex optimization of programmable quantum computers. npj Quantum Information. 42 (2020). ISSN 2056-6387 https://doi.org/10.1038/s41534-020-0268-2

Reuse This article is distributed under the terms of the Creative Commons Attribution (CC BY) licence. This licence allows you to distribute, remix, tweak, and build upon the work, even commercially, as long as you credit the authors for the original work. More information and the full terms of the licence here: https://creativecommons.org/licenses/

Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request.

[email protected] https://eprints.whiterose.ac.uk/ www.nature.com/npjqi

ARTICLE OPEN Convex optimization of programmable quantum computers ✉ Leonardo Banchi1,2 , Jason Pereira 3, Seth Lloyd4,5 and Stefano Pirandola 3,5

A fundamental model of quantum computation is the programmable quantum gate array. This is a quantum processor that is fed by a program state that induces a corresponding quantum operation on input states. While being programmable, any finite- dimensional design of this model is known to be nonuniversal, meaning that the processor cannot perfectly simulate an arbitrary quantum channel over the input. Characterizing how close the simulation is and finding the optimal program state have been open questions for the past 20 years. Here, we answer these questions by showing that the search for the optimal program state is a convex optimization problem that can be solved via semidefinite programming and gradient-based methods commonly employed for machine learning. We apply this general result to different types of processors, from a shallow design based on quantum teleportation, to deeper schemes relying on port-based teleportation and parametric quantum circuits. npj Quantum Information (2020) 6:42 ; https://doi.org/10.1038/s41534-020-0268-2

INTRODUCTION the program states. This already solves an outstanding problem Back in 1997 a seminal work by Nielsen and Chuang1 proposed a that affects various models of quantum computers (e.g., quantum version of the programmable gate array that has variational quantum circuits), where the optimization over classical 1234567890():,; become a fundamental model for quantum computation2. This parameters is non-convex and therefore not guaranteed to is a quantum processor where a fixed quantum operation is converge to a global optimum. By contrast, because our problem applied to an input state together with a program state. The aim is proven to be convex, we can use SDP to minimize the diamond fi of the program state is to induce the processor to apply some distance and always nd the optimal program state for the target quantum gate or channel3 to the input state. Such a desired simulation of a target channel, therefore optimizing the programmable quantum processor. Similarly, we may find suboptimal feature of quantum programmability comes with a cost: the fi model cannot be universal, unless the program state is allowed to solutions by minimizing the trace distance or the quantum delity have an infinite dimension, i.e., infinite qubits1,4. Even though this by means of gradient-based techniques adapted from the ML literature, such as the projected subgradient method14 and limitation has been known for many years, there is still no exact the conjugate gradient method15,16. We note indeed that the characterization on how well a finite-dimensional programmable minimization of the ℓ1-norm, mathematically related to the quantum processor can generate or simulate an arbitrary 17,18 fi quantum trace distance, is widely employed in many ML tasks , quantum channel. Also there is no literature on how to nd the so many of those techniques can be adapted for learning program corresponding optimal program state or even to show that this states. state can indeed be found by some optimization procedure. Here, With these general results in our hands, we first discuss the we show the solutions to these long-standing open problems. optimal learning of arbitrary unitaries with a generic program- Here, we show that the optimization of programmable mable quantum processor. Then, we consider specific designs of quantum computers is a convex problem for which the solution the processor, from a shallow scheme based on the teleportation fi can always be found by means of classical semide nite program- protocol, to higher-depth designs based on port-based teleporta- ming (SDP), and classical gradient-based methods that are tion (PBT)19–21 and parametric quantum circuits (PQCs)22, introdu- commonly employed for machine learning (ML) applications. ML 5 cing a suitable convex reformulation of the latter. In the various methods have found wide applicability across many disciplines , cases, we benchmark the processors for the simulation of basic and we are currently witnessing the development of new hybrid unitary gates (qubit rotations) and various basic channels, areas of investigation, where ML methods are interconnected with including the amplitude damping channel that is known to be quantum information theory, such as quantum-enhanced ML the most difficult to simulate23,24. For the deeper designs, we find 6–10 (refs. ; e.g., quantum neural networks, quantum annealing, that the optimal program states do not correspond to the Choi etc.), protocols of quantum-inspired ML (e.g., for recommendation matrices of the target channels, which is rather counterintuitive systems11 or component analysis and supervised clustering12), and unexpected. and classical learning methods applied to quantum computers, as explored here in this manuscript. In our work, we quantify the error between an arbitrary target RESULTS channel and its programmable simulation in terms of the diamond We first present our main theoretical results on how to train the distance3,13, and other suitable cost functions, including the trace program state of programmable quantum processors, either via distance and the quantum fidelity. For all the considered cost convex optimization or first-order gradient-based algorithms. We functions, we are able to show that the minimization of the then apply our general methods to study the learning of arbitrary simulation error is a convex optimization problem in the space of unitaries, and the simulation of different channels via processors

1Department of Physics and Astronomy, University of Florence, via G. Sansone 1, I-50019 Sesto Fiorentino (FI), Italy. 2INFN Sezione di Firenze, via G.Sansone 1, I-50019 Sesto Fiorentino (FI), Italy. 3Department of Computer Science, University of York, York YO10 5GH, UK. 4Department of Mechanical Engineering, Massachusetts Institute of Technology ✉ (MIT), Cambridge, MA 02139, USA. 5Research Laboratory of Electronics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA. email: leonardo.banchi@uniﬁ.it

Published in partnership with The University of New South Wales L. Banchi et al. 2 the processor Q. Then, using results from refs. 3,25,26, we may write

C π dC1 π 2d CF π ; (6) Åð Þ ð Þ ð Þ where pffiffiffiffiffiffiffiffiffiffiffiffi : C1 π χ χπ 1; (7) ð Þ ¼ kkE À is the trace distance2 between target and simulated Choi matrices, and Fig. 1 Quantum processor Q with program state π that simulates 2 CF π 1 F π ; (8) a quantum channel π from input to output. We also show E ð Þ¼ À ð Þ the CPTP map Λ of the processor, from the program state π to the where F(π) is Bures’ fidelity between the two Choi matrices χ and output Choi matrix χπ (generated by partial transmission of the χ , i.e., E maximally entangled state Φ). π : (9) F π pχ pχπ 1 Tr pχ χπpχ : built either from quantum teleportation and its generalization, or ð Þ ¼ E ¼ E E qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi from PQCs. Another possibleffiffiffiffiffi ffiffiffiffiffi upper boundffiffiffiffiffi can beffiffiffiffiffi written using the quantum 27,28 Pinsker’s inequality . In fact, we may write C1 π ð Þ Programmable quantum computing 2lnp2 CR π , where ð Þ ð Þ Let us consider an arbitrary mapping from d-dimensional input : CR π ffiffiffi pminffiffiffiffiffiffiffiffiffiffiffiffiS χ χπ ; S χπ χ ; (10) states into d0-dimensional output states, where d0 ≠ d in the ð Þ ¼ fgð E jj Þ ð jj E Þ : general case. This is described by a quantum channel that may and S ρ σ Tr ρ log 2ρ log 2σ is the quantum relative represent the overall action of a quantum computationE and does entropyð betweenjj Þ ¼ ½ρðand σ.À In SupplementaryÞ Note 1.3, we also not need to be a unitary transformation. Any channel can be introduce a cost function Cp(π) based on the Schatten p-norm. simulated by means of a programmable quantum processor,E which is modeled in general by a fixed completely positive trace- Convex optimization preserving (CPTP) map Q that is applied to both the input state One of the main problems in the optimization of reconfigurable and a variable program state π. In this way, the processor 1234567890():,; quantum chips is that the relevant cost functions are not convex in transforms the input state by means of an approximate channel the set of classical parameters. This problem is completely solved π as E here thanks to the fact that the optimization of a programmable π ρ Tr2 Q ρ π ; (1) quantum processor is done with respect to a quantum state. In E ð Þ¼ ½ð Þ fact, in the methods section, we prove the following where Tr2 is the partial trace over the program state. A Theorem 1 Consider the simulation of a target quantum channel 1 fi “ ” fundamental result is that there is no xed quantum processor by means of a programmable quantum processor Q. The Q that is able to exactly simulate any quantum channel . In other optimizationE of the cost functions C , C , C , C , or C is a convex terms, given , we cannot find the corresponding programE π such ◇ 1 F R p E problem in the space of program states π. In particular, the global that π. Yet simulation can be achieved in an approximate ~ EE minimum π for C◇ can always be found as a local minimum. sense, where the quality of the simulation may increase for larger This convexity result is generally valid for any cost function that program dimension. In general, the open problem is to determine is convex in π. This is the case for any desired norm, not only the the optimal program state π~ that minimizes the simulation error, fi trace norm, but also the Frobenius norm, or any Schatten p-norm. that can be quanti ed by the cost function It also applies to the relative entropy. Furthermore, the result can C π : π ; (2) also be extended to any convex parametrization of the program Åð Þ ¼EÀEkkÅ states. namely the diamond distance3,13 between the target channel E When dealing with convex optimization with respect to positive and its simulation π. In other words, E operators, the standard approach is to map the problem to a form ~ ~ 29,30 Find π such that C π minπC π : (3) that is solvable via SDP (refs. ). Since the optimal program is Åð Þ¼ Åð Þ the one minimizing the cost function, it is important to write the 1,4 = From theory , we know that we cannot achieve C◇ 0 for computation of the cost function itself as a minimization. For fi arbitrary unless π and Q have in nite dimensions. As a result, for the case of the diamond distance, this can be achieved by using fi E any nite-dimensional realistic design of the quantum processor, the dual formulation29. More precisely, consider the linear map finding the optimal program state π~ is an open problem. Recall Ω : Λ π π with Choi matrix χΩπ χ χπ χ π , and the that the diamond distance is defined by π : ¼EÀE ¼ E À d¼ E À ð Þ kkEÀE Å ¼ spectral norm O : max Ou : u C ; u 1 , which is maxφ φ π φ 1, where is the identity map and kk1 ¼ fkk 2 kk g kkI Eð ÞÀI E ð Þ 2 I p O : TrpOyO is the trace norm . the maximum eigenvalue of OyO. Then, by the strong duality of 1 30 kkIt is¼ important to note that this problem can be reduced to a the diamond norm, C π Ωπ is given by the SDP (ref. ) Åð Þ¼kkffiffiffiffiffiffiffiffiffiÅ simpler oneffiffiffiffiffiffiffiffiffi by introducing the channel’s Choi matrix Minimize 2 Tr2Z ; χ Φ kk1 (11) π π Subject to Z 0 and Z d χ Λ π : E ¼I E ð Þ 1 (4) ð E À ð ÞÞ dÀ i j Tr2 Q i j π ; ¼ j ihj ½ðj ihj Þ The importance of the above dual formulation is that the diamond ij distance is a minimization, rather than a maximization over a set P where Φ : Φ Φ is a d-dimensional maximally entangled state. of matrices. In order to find the optimal program π~, we apply the ¼ jihj fi From this expression, it is clear that the Choi matrix χ π is linear in unique minimization of Eq. (11), where π is variable and satis es E ≥ the program state π. More precisely, the Choi matrix χ π at the the additional constraints π 0 and Tr π 1. output of the processor Q can be directly written as a CPTPE linear In the methods section, we show thatð otherÞ¼ cost functions, such map Λ acting on the space of the program states π, i.e., as C1 and CF can also be written as SDPs. Correspondingly, the ~ : Λ optimal programs π can be obtained by numerical SDP solvers. χπ χ π π : (5) ¼ E ¼ ð Þ Most numerical packages implement second-order algorithms, This map is also depicted in Fig. 1, and fully describes the action of such as the interior point method31. However, second-order

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 3 methods tend to be computationally heavy for large problem apply sizes32,33, namely when π~ contains many qudits. In the following 1 Find the smallest eigenvalue σi of ∇C πi ; section, we introduce first-order methods, that are better suited Þ j i ð Þ for larger program states. It is important to remark that there also i 2 (17) 2 πi 1 πi σi σi : exist zeroth-order (derivative-free) methods, such as the simulta- Þ þ ¼ i 2 þ i 2 j ihj 34 þ þ neous perturbation stochastic approximation method , which When the gradient of f is Lipschitz continuous with constant L, the 35 was utilized for a quantum problem in ref. . However, it is known method converges after L=ϵ steps16,41. To justify the applic- that zeroth-order methods normally have slower convergence ability of this method, a suitableOð Þ smoothening of the cost function 36 times compared to first-order methods. must be employed. The downside of the conjugate gradient method is that it necessarily requires a differentiable cost function Gradient-based optimization C, with gradient ∇C. Specifically, this may create problems for the trace distance cost C1 that is generally non-smooth. A solution to In ML applications, where a large amount of data is commonly fi available, there have been several works that study the minimiza- this problem is to de ne the cost function in terms of the smooth tion of suitable matrix norms for different purposes17,18,37,38. First- trace distance Cμ π Tr hμ χπ χ , where hμ is the so-called ð Þ¼ ðÞÀ2 E − order methods, are preferred for large dimensional problems, as Huber penalty function hμ(x):= x /(2μ)if∣x∣ < μ and ∣x∣ μ/2 if ÂÃfi they are less computationally intensive and require less memory. ∣x∣ ≥ μ. This quantity satis es Cμ(π) ≤ C1(π) ≤ Cμ(π) + μd/2 and is a Here, we show how to apply first-order (gradient-based) convex function over program states, with gradient Λ algorithms, which are widely employed in ML applications, to ∇Cμ π Ã hμ0 χπ χ ). ð Þ¼ ½ ð À E Þ find the optimal quantum program. For this purpose, we need to introduce the subgradient of the Learning of arbitrary unitaries cost function C at any point π , which is the set fi 2S One speci c application is the simulation of quantum gates or, more generally, unitary transformations22,42–45. Here, the infidelity ∂C π Z : C σ C π Tr Z σ π ; σ ; (12) ð Þ¼f ð ÞÀ ð Þ ½ ð À Þ 8 2Sg provides the most convenient cost function, as the optimal where Z is Hermitian 39,40.IfC is differentiable, then ∂C(π) contains program can be found analytically. In fact, suppose we use a a single element: its gradient ∇ C(π). We explicitly compute this quantum processor with map Λ to simulate a target unitary U. Because the Choi matrix of U is pure χ χ ,wefirst note that gradient for an arbitrary programmable quantum processor (1) jiU hjU Λ F π 2 χ Λ π χ and then we see that Eq. (15) drastically whose Choi matrix χ π χπ π , can be written as a quantum U U E ¼ ð Þ ð Þ ¼ hjð Þj i channel Λ that maps a generic program state to the processor’s 2 simplifies to ∇F π ΛÃ χ χ = 4F π . As a result, we find Choi matrix. This map can be defined by its Kraus decomposition ð Þ¼ ðÞj UihjU ð Þ Λ ∇CF π Ã χU χU ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffi (18) Λ π AkπAy for some operators A . In fact, let us call ΛÃ ρ ð Þ¼ k k k ð Þ¼ ð Þ¼À ½j ihj Ay ρA the dual map, then in the methods section we prove the where there is no dependence on π. Therefore, using the k k Pk following conjugate gradient method in Eq. (17), we see that the optimal ~ fi fi PTheorem 2 Consider an arbitrary quantum channel with Choi program state π for the in delity cost function CF is a xed point of matrix χ that is simulated by a quantum processor Q withE map Λ(π) the iteration and is equal to the maximum eigenvector of = E Λ* ΛÃ χ χ . χπ (and dual map ). Then, we may write the following gradients ½jiU hjU for the trace distance cost C1(π) and the infidelity cost CF(π) Teleportation processor Λ ∇C1 π sign λk Ã Pk ; (13) Once we have shown how to optimize a generic programmable ð Þ¼ k ð Þ ð Þ X quantum processor, we discuss some specific designs, over which we will test the optimization procedure. One possible (shallow) ∇CF π 2 1 CF π ∇F π ; (14) design for the quantum processor Q is a generalized teleportation ð Þ¼À À ð Þ ð Þ protocol46 over an arbitrary program state π. In dimension d, the pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Φ 1 1 protocol involves a basis of d maximally entangled states i and ∇F π Λ χ χ Λ π χ À2 χ (15) j i Ã p p p p ; a basis {Ui} of teleportation unitaries such that Tr UiyUj dδij ð Þ¼2 E E ð Þ E E 47 (ref. ). An input d-dimensional state ρ and the Að partÞ¼ of the hiffiffiffiffiffiÀÁffiffiffiffiffi ffiffiffiffiffi ffiffiffiffiffi where λk (Pk) are the eigenvalues (eigenprojectors) of the program π are subject to the projector Φi Φi . The classical AB j ihj Hermitian operator χπ χ . When C1(π)orCF(π) are not outcome i is communicated to the B part of πAB, where the À E 1 differentiable in π, then the above expressions provide an element correction UiÀ is applied. of the subgradient ∂C(π). The above procedure defines the teleportation channel π Once we have the (sub)gradient of the cost function C, we can over ρ E solve the optimization minπ C π , using the projected subgra- tele B SA S AB SA B 14,39 2S ð Þ ρ U Φ ρ π Φ U y: dient method . Let be the projection onto the set of Eπ ð Þ¼ i i i i i (19) PS i program states , namely X argminπ S X π 2, that we X S PSð Þ¼ 2 k À k Λ show to be computable from the spectral decomposition of any Its Choi matrix can be written as χπ = tele(π), where the map of Hermitian X (see Theorem 3 in the “Methods” section). Then, we the teleportation processor is equal to iteratively apply the steps 2 Λtele π dÀ UÃ Ui π UÃ Ui y; ð Þ¼ i i (20) 1 Select an operator g from ∂C π ; i i i XÀÁÀÁ Þ ð Þ (16) which is clearly self-dual Λ* = Λ. Given a target quantum channel 2 πi 1 πi αigi ; Þ þ ¼PSðÞÀ which is teleportation covariant23,24, namely whenE where i is the iteration index, αi is what is called "learning rate”, π; UÃ Ui 0, then we know that its simulation is perfect and ½ i ¼ and Theorem 2 can be employed to find gi at each step. It is the optimal program π~ is the channel’s Choi matrix, i.e., one of the ~ simple to show that πi converges to the optimal program state π fixed points of the map Λtele. For a general channel, the optimal 2 in ϵÀ steps, for any desired precision ϵ such that program π~ can be approximated by using the cost functions in our C πOð C Þπ~ ϵ. Another approach is the conjugate gradient Theorem 2 with Λ being given in Eq. (20), or directly found by j ð ÞÀ15ð,39Þj method , sometimes called Frank–Wolfe algorithm. Here, we optimizing C◇(π).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 4 is possible only in the limit N → ∞. Since the standard PBT protocol provides an approximation to the identity channel, we call it N. From the PBT simulation of the identity channel, it is possibleI to approximate any general channel by noting that can be written as a composition , whereE is the identityE channel. This is done by replacingEI the identityI channel with its PBT I simulation N, and then applying to Bi. However, since Bob does not performI any post-processingE on his systems B, aside from discarding all ports Bk with k ≠ i, he can also apply ﬁrst the channel N to all his ports and then discard all the ports Bk with k ≠ i.In doingE so, he changes the program state to

1 N N Φ N Ak Bk πAB A B k 1 Ak Bk k 1χ : (25) ¼ E ¼ ¼ ¼ E In other terms,ÂÃ any channel can be PBT approximated by N copies of its Choi matrix χ as programE state. Since PBT simulation E N can be decomposed as π N, the error C π in simulating the channel E ¼EIsatisfies Å ¼kEÀE kÅ EEI N 1 C N N 2d d 1 NÀ ; (26) Fig. 2 PBT scheme. Two distant parties, Alice and Bob, share N Å ¼kEIÀEI kkI ÀI kÅ ð À Þ N where we used the data processing inequality and an upper maximally entangled pairs Ak; Bk k 1. Alice also has another system 48 f g ¼ ΠAC bound from ref. . While the channel’s Choi matrix assures that C in the state ψ . To teleport C, Alice performs the POVM i on N j i N f g C 0 for large N, for any finite N it does not represent the all her local systems A Ak k 1 and C. She then communicates the Å ! ¼f g ¼ N optimal program state. In general, for any finite N, finding the outcome i to Bob. Bob discards all his systems B B with the k k 1 optimal program state π simulating a channel with PBT is an exception of B . After these steps, the state ψ¼fis approximatelyg ¼ AB i ji open problem, and no explicit solutions or proceduresE are known. teleported to Bi. Similarly, an arbitrary channel is simulated with A B E fi N copies of the Choi matrix χ k k . The figure shows an example with We employ our convex optimization procedures to nd the N = 5, where i = 4 is selected.E optimal program state. This can be done either exactly by minimizing the diamond distance cost function C◇ via SDP, or Port-based teleportation approximately, by determining the optimal program state via the minimization of the trace distance cost function C1 via either SDP A deeper design is provided by a PBT processor, whose overall or the gradient-based techniques discussed above. For this second protocol is illustrated in Fig. 2. Here, we consider a more general 19,20 approach, we need to derive the map Λ of the PBT processor, formulation of the original PBT protocol , where the resource between the program state π to output Choi matrix as in Eq. (5). entangled pairs are replaced by an arbitrary program state π.Ina Λ ‘ ’ = … To compute the Choi matrix and CP-map , we consider an input PBT processor, each party has N systems (or ports ), A {A1, , Φ i = … maximally entangled state DC and a basis ej of A{B\Bi}C. Then, AN} for Alice and B {B1, , BN} for Bob. These are prepared in a j fii Λ j i Φ by using Eq. (21) and the de nition π χ π 1D π DC , program state πAB. To teleport an input state ρC, Alice performs a fi Λ ð Þ¼ P ¼ P ½ 19 we nd the map AB DBout of a PBT processor joint positive operator-value measurement (POVM) {Πi} (ref. )on ! system C and the A-ports. She then communicates the outcome i Λ : i Π Φ π KijπKijy; Kij ej i 1BD DC : (27) to Bob, who discards all ports except Bi, which is the output Bout. ð Þ¼ ij ¼ j i : X D pffiffiffiffiffi The resulting PBT channel π C Bout is then P H 7!H Note that a general program state for PBT consists of 2N qudits, N and hence the parameter space has exponential size d4N. π ρ Tr pΠi πAB ρ pΠi ABi C C Bi Bout P ð Þ¼i 1 ð Þ ! However, because the PBT protocol is symmetric under permuta- ¼ (21) PN ÂÃffiffiffiffiffi ffiffiffiffiffi tion of port labels, we show in Supplementary Note 6 that one can Tr Π π ρ ; exploit this symmetry and reduce the number of free parameters ABi C i AB C Bi Bout ¼ i 1 ½ð Þ ! 4 ¼ fi N d 1 P : to the binomial coeff cient þ4 À , which is polynomial in where Bi B Bi Bk k ≠ i . d 1 ¼ n ¼f g 19,20 In the standard PBT protocol , the program state is fixed as the number of ports N. Despite thisÀ exponential reduction, the N Φ Φ πAB k 1 Ak Bk , where Ak Bk are Bell states, and the following scaling in the number of parameters still represents a practical ¼ ¼ ji POVM is used limiting factor, even for qubits for which N15 . A suboptimal N strategy consists in reducing the space ofO program states to a ~ 1 ~ ÀÁ Πi Πi 1 Πk ; (22) convex set that we call the “Choi space” . Consider an arbitrary ¼ þ N À k ! probability distribution {p } and then defineC X k where : k N k 11 π π pkρAB ; TrB ρAB dÀ : (28) ~ 1=2 1=2 C¼f ¼ k ð Þ¼ g Πi σÀ ΦA CσÀ ; (23) ¼ AC i AC X One can show (see Supplementary Note 6) that a global minimum N : Φ in is a global minimum in the extremal (non-convex) subspace σAC Ai C; (24) C N for pk = δk,1, consisting of tensor products of Choi matrices ρ . ¼ i 1 AB ¼ −1/2X Among these states, there is the N-copy Choi matrix of the target and σ is an operator defined only on the support of σ. The PBT N N channel χ Φ Φ , which is not necessarily the protocol is formulated for N ≥ 2 ports. However, we also include optimal program,E ¼½I Eð as wej showih jÞ below. here the trivial case for N = 1, corresponding to the process where Alice’s input is traced out and the output is the reduced state of Bob’s port, i.e., a maximally mixed state. In the limit N → ∞, the Parametric quantum circuits standard PBT protocol approximates an identity channel Another deep design of quantum processor is based on 19,21 1 22,49 π ρ ρ, with fidelity Fπ 1 , so perfect simulation PQCs (refs. ). A PQC is a sequence of unitary matrices P ð Þ ¼ ÀO N ÀÁ npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 5

U(t) = U (t )…U (t )U (t ), where Uj tj exp itjHj for some where 0 is a zero operator, so that gate j enacts the identity N N 2 2 1 1 ð Þ¼ ð Þ Hamiltonian Hj and time interval tj. The problem with PQCs is channel if program qutrit j is in the state 2 2 . Then, if it were the that the cost functions in the classical parameters42 are not case that a PQC with N program qubitsj couldihj simulate a given convex, so that numerical algorithms are not guaranteed to channel better than one with N + m, a mPQC with N + m qutrits in converge to the global optimum. Here, we fix this issue by the program state could perform at least, as well as the PQC with introducing a convex formulation of PQCs, where classical N program qubits by setting the first m qutrits to 2 2 . This parameters are replaced by a quantum program. This results in a processor design is both universal and monotonic. Morej i precisely,hj programmable PQC processor that is optimizable by our methods. let C(PQCN) denote the value of a cost function C for simulating a The universality of PQCs can be employed for universal channel channel with an N-gate PQC, using the optimal program state, E simulation. Indeed, thanks to Stinespring’s dilation theorem, any and let C(mPQCN) denote the value of C for simulating with an channel can be written as a unitary evolution on a bigger space, N-gate mPQC, again using the optimal program state. WeE are then ρ TrR U ρ θ0 Uy ; where the system is paired to an extra guaranteed that Eð AÞ¼ 0 ½ ð A Þ register R0 and θ0 belongs to R0. In the Stinespring representation, 49 C mPQCN min C PQCM : (32) U acts on system A and register R . In ref. , it has been shown M N 0 ð Þ ð Þ that sequences of two unitaries, U0 and U1, are almost universal for simulation, i.e., any target unitary U can be approximated as U m4 m3 m2 m1 Processor benchmarking U1 U0 U1 U0 for some integers mj. Under suitable conditions, ÁÁÁit takes d2ϵ d steps to approximate U up to precision ϵ. The In order to show the performance of the various architectures, we Oð À Þ choice between U0 and U1 is done by measuring a classical bit. We consider the simulation of an amplitude damping channel with may introduce a quantum version, where the two different probability p. The reason is because this is the most difficult iH iH unitaries U0 e 0 or U1 e 1 are chosen depending on the state channel to simulate, with a perfect simulation only known for ¼ ¼ fi of qubit Rj. This results in the conditional gate in nite dimension, e.g., using continuous-variable quantum operations23. In Figs. 4 and 5, we compare teleportation-based, ^ “ ” Uj exp iH0 0 jj 0 iH1 1 jj 1 : (29) PBT, PQC, and mPQC programmable processors whose program ¼ j i hjþ j i hj states have been optimized according to the cost functions C◇ Channel simulation is then obtained by replacing the unitary and C1. For the PBT processor, the trace distance cost C1 is evolution U in the Stinespring dilation via its simulation. The result remarkably close to C◇ and allows us to easily explore high is illustrated in Fig. 3, where the program state π is defined over R depths. Note that the optimal program states differ from the naive ^ = (R0, …, RN) and each Hj acts on the input system A and two choice of the Choi matrix of the target channel. Note too that PQC ancillary qubits R0 and Rj. Following the universality construction processors display non-monotonic behavior when simulating of ref. 49, we show in the Supplementary Note 3.4 that the channel amplitude damping channels, meaning that shallow PQC proces- shown in Fig. 3 provides a universal processor. Moreover, the sors (e.g., for N = 4) may perform better than deeper processors. Λ ’ channel that maps any program π to the processor s Choi matrix For the PQC processor, we use the universal Hamiltonians H0 ¼ is obtained as p2 X Y Y X and H1 p2Z p3Y p5X Y p2Z , ð À Þ ¼ð þ þ Þ ð þ Þ ^ ^ where X, Y, and Z are Pauli operators. mPQC processors guarantee Λ π TrR UAR ΦBA πR Uy ; (30) ffiffiffi ffiffiffi ffiffiffi ffiffiffi ffiffiffi ð Þ¼ ðÞ AR that deeper designs always perform at least, as well as any ^ hiN ^ shallower design. In Fig. 5 perfect simulation is achievable at where UAR 1B Uj , from which we can identify the j 1 A;R0;Rj specific values of p because of our choice of the universal gates U optimal program¼ π~ via¼ our methods. 0 jiQ and U1. More details are provided in the Supplementary Note 3. PQCs are not inherently monotonic. A deeper (higher N) design Many other numerical simulations are performed in the may simulate a given channel worse than a more shallow design. fi Supplementary Note 3 where we study the convergence rate in We can design a modi ed PQC that is monotonic by design, which learning a unitary operation, the exact simulation of Pauli we designate a “monotonic PQC” (mPQC), by replacing the qubits in our program state with qutrits, and modifying Eq. (29) to read ^ Uj exp iH0 0 0 iH1 1 1 0 2 2 ; (31) ¼ j ijjhjþ j ijjhjþ j ijjhj

Fig. 4 Diamond distance error CÅ in simulating an amplitude damping channel p at various damping rates p. We compare the performance of differentE designs for the programmable quantum processor: standard teleportation and PBT with N ports (PBTN). The optimal program π~ is obtained by either minimizing directly the diamond distance C (solid lines), or the trace distance C1 (dashed Fig. 3 PQC Scheme. Thanks to the Stinespring decomposition, a lines) via the projectedÅ subgradient iteration. In both cases, from π~ quantum channel is written as a unitary evolution in an extended we then compute C π~ . The lowest curves are obtained by space. The corresponding unitary is then simulated via a parametric optimizing π over theÅ Choið Þ space in Eq. (28). For comparison, we quantum circuit that, at each time, applies a certain unitary also show the (non-optimal) performance when the program is the depending on the state of the quantum register. channel’s Choi matrix (dotted lines).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 6 approximates the overall quantum channel to be applied to the input. The server then accepts the input from the client, processes it, and returns the output together with the value of a cost function quantifying how close the computation was with respect to the client’s request. Our results may also be useful in areas beyond quantum computing, wherever channel simulation is a basic problem. For instance, this is the case when we investigate the ultimate limits of quantum communications24, design optimal Hamiltonians for one-way quantum repeaters, and for all those areas of quantum sensing, hypothesis testing and metrology that are based on quantum channel simulations 51. Indeed the study of adaptive protocols of quantum channel discrimination (or estimation) is notoriously difficult, and their optimal performance is not Fig. 5 Diamond distance error C in simulating an amplitude completely understood. Nonetheless, these protocols can be ◇ 48,51 damping channel p at various damping rates p. We compare the analyzed by using simulation techniques where the channel, performance of twoE different designs for the programmable encoding the unknown parameter, is replaced by an approximate quantum processor: PQCs with N + 1 registers (PQCN) and mono- simulating channel, and its parameter is mapped into the label of tonic parametric quantum circuits with N + 1 registers (mPQCN). In a program state (therefore reducing the problem from channel to both cases the optimal program π~ is obtained by minimizing the j i state discrimination/estimation). In this regard, our theory diamond distance C◇. provides the optimal solution to this basic problem, by determining the best simulating channel and the corresponding channels, and approximate simulation of both dephasing and program state. amplitude damping channels. In particular, we study the performance of the approximate solution when optimizing over larger, but easier-to-compute, cost functions, such as the trace METHODS fi distance or the in delity. Convexity proofs In this section, we provide a proof of Theorem 1, namely we show that the DISCUSSION minimization of the main cost functions C◇, C1, and CF is a convex optimization problem in the space of the program states π. This means In this work, we have considered a general finite-dimensional ~ that we can find the optimal program state π by minimizing C◇ or, model of a programmable quantum processor, which is a alternatively, suboptimal program states can be found by minimizing fundamental scheme for quantum computing and also a primitive either C1 or CF. For the sake of generality, we prove the result for all of the tool for other areas of quantum information. By introducing cost functions discussed in the previous sections. We restate Theorem 1 suitable cost functions, based on the diamond distance, trace below for completeness: = distance and quantum fidelity, we have shown how to character- Theorem The minimization of the generic cost function C C◇, C1, CF, CR ize the optimal performance of this processor in the simulation of or Cp for any p >1is a convex optimization problem in the space of program states. an arbitrary quantum gate or channel. In fact, we have shown that Proof Let us start to show the result for the diamond distance C◇. In this the minimization of these cost functions is a convex optimization case, we can write the following problem that can always be solved. In particular, by minimizing the diamond distance via SDP, we C pπ 1 p π0 Å½ þð À Þ can always determine the optimal program state for the : pπ 1 p π0 ¼EÀEþð À Þ Å simulation of an arbitrary channel. Alternatively, we may minimize 1 ð Þ p 1 p p 1 p the simpler but larger cost functions in terms of trace distance and π π0 ¼ kk ð þ À ÞE À E Àð À ÞE Å 2 (33) quantum fidelity via gradient-based methods adapted from ML, so ð Þ p p π 1 p 1 p π0 as to provide a very good approximation of the optimal program kkEÀ E Å þðkkÀ ÞE À ð À ÞE Å 3 state. This other approach can also provide closed analytical ð Þ p π 1 p π0 solutions, as is the case for the simulation of arbitrary unitaries, for kkEÀE Å þð À ÞEÀEkkÅ pC π 1 p C π0 ; which the minimization of the fidelity cost function corresponds ¼ Åð Þþð À Þ Åð Þ to computing an eigenvector. where we use (1) the linearity of , (2) the triangle inequality, and (3) the E We have then applied our results to various designs of property ∥xA∥1 = ∣x∣∥A∥1, valid for any operator A and coefficient x. programmable quantum processor, from a shallow teleportation- For any Schatten p-norm Cp with p ≥ 1, we may prove convexity : based scheme to deeper and asymptotically universal designs that following a similar reasoning. Since for any combination π p0π0 p1π1, + = Λ Λ Λ ¼ þ are based on PBT and PQCs. We have explicitly benchmarked the with p0 p1 1, we have π p0 π0 p1 π1 , then by exploiting the triangle inequality, and theð propertyÞ¼ ð∥xAÞþ∥ = ∣xð∣∥AÞ∥ , we can show that performances of these quantum processors by considering the p p simulation of unitary gates, depolarizing and amplitude damping : Λ Cp p0π0 p1π1 χ p0π0 p1π1 p ð þ Þ ¼k E À ð þ Þk channels, showing that the optimal program states may differ Λ Λ p0 χ π0 p p1 χ π1 p (34) from the naive choice based on the Choi matrix of the target k E À ð Þk þ k E À ð Þk p Cp π0 p Cp π1 : channel. Moreover, our results can be applied also for universal ¼ 0 ð Þþ 1 ð Þ 50 quantum measurements . To show the convexity of CF,defined in Eq. (8), we note that the fidelity A potential application of our work may be the development of function F(ρ, σ) satisfies the following concavity relation52 “programmable” model of cloud-based quantum computation, where a client has an input state to be processed by an online 2 2 (35) quantum server that is equipped with a programmable quantum F pk ρk ; σ pk F ρk; σ : k ! k ð Þ processor. The client classically informs the server about what type X X fi = Λ fi fi 2 of computation it needs (e.g., some speci ed quantum algorithm) Due to the linearity of χπ (π), the delity in Eq. (9) satis es Fπ 2 and the server generates an optimal program state that closely p F for π : p πk. Accordingly, we get the following convexity k k πk ¼ k k P P npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 7 result C is not differentiable, the gradient still provides an element of the subgradient that can be used in the minimization algorithm. C p π p C π (36) In order to compute the gradient ∇C, it is convenient to consider the F k k k F k : Λ k ! k ð Þ Kraus decomposition of the processor map . Let us write X X For the cost function CR, the result comes from the linearity of Λ(π) and the Λ π Ak πAy ; ð Þ¼ k (41) joint convexity of the relative entropy. In fact, for π : p π0 p π1,we k ¼ 0 þ 1 X may write * with Kraus operators Ak. We then define the dual map Λ of the processor Λ Λ Λ S π χ S p0 π0 p1 π1 χ as the one (generally non-trace-preserving), which is given by the ½ ð Þjj E ¼ ½ ð Þþ ð Þjj E Λ Λ following decomposition S p0 π0 p1 π1 p0χ p1χ (37) ¼ ½ ð Þþ ð Þjj E þ E Λ Λ Λ p0S π0 ; χ p1S π1 ; χ ; Ã ρ Aky ρAk : ½ ð Þ E þ ½ ð Þ E (42) ð Þ¼ k with a symmetric proof for S χ Λ π . This implies the convexity of CR(π)in X Eq. (10).■ ½ E jj ð Þ With these definitions in hands, we can now prove Theorem 2, which we rewrite here for convenience. Theorem Suppose we use a quantum processor Q with map Λ(π) = χπ in Convex classical parametrizations order to approximate the Choi matrix χ of an arbitrary channel . Then, the The result of the Theorem 1 can certainly be extended to any convex E E gradients of the trace distance C1(π) and the infidelity CF(π) are given by the parametrization of program states. For instance, assume that π = π(λ), following analytical formulas where λ = {λi} is a probability distribution. This means that, for 0 ≤ p ≤ 1 and λ λ Λ any two parametrizations, and 0, we may write ∇C1 π sign λk Ã Pk ; (43) ð Þ¼ k ð Þ ð Þ π pλ 1 p λ0 pπ λ 1 p π λ0 : (38) X ½þð À Þ ¼ð Þþð À Þ ð Þ Then the problem remains convex in λ and we may therefore find the ∇CF π 2 1 CF π ∇F π ; (44) global minimum in these parameters. It is clear that this global minimum ~λ ð Þ¼À À ð Þ ð Þ ~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi identifies a program state π λ that is not generally the optimal state π~ in 1 1 2 the entire program spaceð Þ , even though the solution may be a ∇F π ΛÃ pχ pχ Λ π pχ À pχ ; (45) ð Þ¼2 E E ð Þ E E convenient solution for experimentalS applications. hiÀÁ Note that a possible classical parametrization consists of using classical where λk (Pk) areffiffiffiffiffi the eigenvaluesffiffiffiffiffi ffiffiffiffiffi (eigenprojectors)ffiffiffiffiffi of the Hermitian operator program states, of the form χπ χ . When C1(π) or CF(π) are not differentiable at π, then the above expressionsÀ E provide an element of the subgradient ∂C(π). π λ λi φ φ ; Proof We prove the above theorem assuming that the functions are ð Þ¼ jii hji (39) i differentiable for program π. For non-differentiable points, the only X difference is that the above analytical expressions are not unique and where φi is an orthonormal basis in the program space. Convex combinationsfj ig of probability distributions therefore define a convex set of provide only one of the possibly infinite elements of the subgradient. classical program states Further details of this mathematical proof are given in Supplementary Note 4. Following matrix differentiation, for any function f A Tr g A of a : ð Þ¼ ½ ð Þ class π π λi φi φi ; φi φj δij : (40) matrix A, we may write S ¼f ¼ i jihjhji¼ g X d Tr g A Tr g0 A dA ; (46) Optimizing over this specific subspace corresponds to optimizing the ½ ð Þ ¼ ½ ð Þ programmable quantum processor over classical programs. It is clear that and the gradient is ∇f A g0 A . Both the trace distance and fidelity cost ð Þ¼ ð Þ global minima in class and are expected to be very different. For functions can be written in this form. To find the explicit gradient of the S S instance, class cannot certainly include Choi matrices that are usually very fidelity function, we first note that, by linearity, we may write good quantumS programs. Λ π δπ Λ π Λ δπ ; (47) ð þ Þ¼ ð Þþ ð Þ Gradient-based optimization and therefore the following expansion As discussed in the main text, the SDP formulation allows the use of pχ Λ π δπ pχ pχ Λ π pχ pχ Λ δπ pχ : (48) powerful and accurate numerical methods, such as the interior point E ð þ Þ E ¼ E ð Þ E þ E ð Þ E method. However, these algorithms are not suitable for high-dimensional Fromffiffiffiffiffi this equationffiffiffiffiffi andffiffiffiffiffi differentialffiffiffiffiffi calculationsffiffiffiffiffi ffiffiffiffiffi of the fidelity (see problems, due to their higher computational and memory requirements. Supplementary Note 4.2 for details), we find Therefore, an alternative approach (useful for larger program states) 1 1 consists of the optimization of the larger but easier-to-compute cost dF Tr pχ Λ π pχ À2 pχ Λ δπ pχ ; (49) function C = C (trace distance) or C (infidelity), for which we can use first- ¼ 2 ð E ð Þ E Þ E ð Þ E 1 F hi order methods. Indeed, according to Theorem 1, all of the proposed cost where dF = F(πffiffiffiffiffi+ δπ) −ffiffiffiffiffiF(π). Then,ffiffiffiffiffi usingffiffiffiffiffi the cyclic property of the trace, functions C : R are convex over the program space and, therefore, we get S! S we can solve the optimization minπ C π by using gradient-based 2S ð Þ 1 1 algorithms. dF Tr ΛÃ pχ pχ Λ π pχ À2 pχ δπ : (50) Gradient-based convex optimization is at the heart of many popular ML ¼ 2 E ð E ð Þ E Þ E 17 hihi techniques, such as online learning in a high-dimensional feature space , Exploiting this expressionffiffiffiffiffi ffiffiffiffiffi in Eq.ffiffiffiffiffi (46),ffiffiffiffiffi we get the gradient ∇F(π) as in Eq. 18 missing value estimation problems , text classification, image ranking, (45). The other Eq. (44) simply follows from applying the definition in Eq. 53 and optical character recognition , to name a few. In all of the above (8). applications, "learning” corresponds to the following minimization For the trace distance, let us write the eigenvalue decomposition problem: minx f x , where f(x) is a convex function and is a convex 2S ð Þ S set. Quantum learning falls into this category, as the space of program χπ χ λk Pk : À E ¼ (51) states is convex due to the linearity of quantum mechanics and the fact k X that cost functions are typically convex in this space (see Theorem 1). Then using the linearity of Eq. (47), the definition of a processor map of Eq. Gradient-based approaches are among the most applied methods for 39 (5) and differential calculations of the trace distance (see Supplementary convex optimization of non-linear, possibly non-smooth functions . Note 4.3 for details), we can write When the cost function is not differentiable, we cannot formally define its gradient. Nonetheless, we can always define the subgradient ∂C of C as dC1 π sign λk Tr Pk Λ dπ in Eq. (12), which in principle contains many points. When C is not only ð Þ¼ k ð Þ ½ ð Þ convex but also differentiable, then ∂C(π) = {∇C(π)}, i.e., the subgradient Psign λk Tr ΛÃ Pk dπ (52) ¼ ð Þ ½ ð Þ contains a single element, the gradient ∇C, that can be obtained via the k P Λ Fréchet derivative of C (for more details see Supplementary Note 4). When Tr Ã sign χπ χ dπ : ¼ fg½ ð À E Þ

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 8 From the definition of the gradient in Eq. (46), we finally get For any unitarily invariant norm, the following inequality holds (ref. 55, Eq. Λ IV.64): ∇C1 π Ã sign χπ χ ; (53) ð Þ¼ ½ ð À E Þ X π x λ ; (60) which leads to the result in Eq. (43). ■ k À k2 k À k2 The above results in Eqs. (44) and (43) can be used together with the with equality when U = V, where X = UxU† is a spectral decomposition of X 14 15,16 projected subgradient method or conjugate gradient algorithm to such that the xj’s are in decreasing order. This shows that the optimal iteratively find the optimal program state in the minimization of unitary in Eq. (59) is the diagonalization matrix of the operator X. The minπ C π for C = C1 or CF. In the following sections, we present two eigenvalues of any density operator form a probability simplex. algorithms,2S ð Þ the projected subgradient method and the conjugate gradient The optimal eigenvalues λ are then obtained thanks to Algorithm 1 from method, and show how they can be adapted to our problem. ref. 17. ■ Projected subgradient methods have the advantage of simplicity and In the following section, we present an alternative algorithm with faster the ability to optimize non-smooth functions, but can be slower, with a convergence rates, but stronger requirements on the function to be 2 convergence rate ϵÀ for a desired accuracy ϵ. Conjugate gradient optimized. 15,16 OðÞ 1 methods have a faster convergence rate ϵÀ , provided that the cost function is smooth. This convergence rateOðÞ can be improved even 1=2 54 Conjugate gradient method further to ϵÀ for strongly convex functions or using Nesterov’s 15,39 acceleratedO gradient method41. The technical difficulty in the adaptation of The conjugate gradient method , sometimes called the Frank–Wolfe these methodsÀÁ for learning program states comes because the latter is a algorithm, has been developed to provide a better convergence speed and constrained optimization problem, namely at each iteration step the to avoid the projection step at each iteration. Although the latter can be optimal program must be a proper quantum state, and the cost functions explicitly computed for quantum states (thanks to our Theorem 3), having coming from quantum information theory are, generally, non-smooth. a faster convergence rate is important, especially with higher dimensional Hilbert spaces. The downside of this method is that it necessarily requires a differentiable cost function C, with gradient ∇C. Projected subgradient method In its standard form, the conjugate gradient method to approximate the fi fi Given the space of program states, let us de ne the projection onto solution of argminπ C π is de ned by the following iterative rule as S PS 2S ð Þ S 1 Find argminσ Tr σ∇C πi ; Þ 2S ½ ð Þ (61) X argmin X π 2 ; (54) 2 π π 2 σ π i π 2 σ PSð Þ¼ π S k À k i 1 i i 2 i i 2 i i 2 : 2 Þ þ ¼ þ þ ð À Þ¼ þ þ þ where argmin is the argument of the minimum, namely the closest state The first step in the above iteration rule is solved by finding the smallest π to the operator X. Then, a first-order algorithm to solve minπ C π eigenvector σ of ∇C(πi). Indeed, since π is an operator and C(π) a scalar, 14,39 2S j i is2S to apply the projected subgradient method , which iteratively appliesð Þ the gradient ∇C is an operator with the same dimension as π. Therefore, the iteration (16), which we rewrite below for convenience for learning quantum programs, we find the iteration (17), that we rewrite below for convenience 1 Select an operator g from ∂C πi ; Þ i ð Þ (55) 1 Find the smallest eigenvalue σi of ∇C πi ; 2 Update πi 1 πi αigi ; Þ ji ð Þ Þ þ ¼PSðÞÀ i 2 (62) where i is the iteration index and αi a learning rate. 2 πi 1 πi σi σi : Þ þ ¼ i 2 þ i 2 j ihj The above algorithm differs from standard gradient methods in two þ þ aspects: (i) the update rule is based on the subgradient, which is defined When the gradient of C is Lipschitz continuous with constant L, the 16,41 even for non-smooth functions; (ii) the operator π − α g is generally not a conjugate gradient method converges after L=ϵ steps . The i i i Oð Þ quantum state, so the algorithm fixes this issue by projecting that operator following iteration with adaptive learning rate αi has even faster 54 back to the closest quantum state, via Eq. (54). The algorithm converges to convergence rates, provided that C is strongly convex : ~ 14 the optimal solution π* (approximating the optimal program π)as 1 Find the smallest eigenvalue σi of ∇C πi ; Þ ji ð Þ e G i α2 1 k 1 k : ϵ 2 Find αi argminα 0;1 α τi; ∇C πi C πi C π þ ¼ ; (56) Þ ¼ 2½ h ð Þi (63) ð ÞÀ ð ÃÞ i ¼ β 2 2 k 1 αk α2 C τ ; for τ σ σ π ; P¼ 2 i C i i i i 2 þ k k ¼ j ihjÀ where e1 π1 π P2 is the initial error (in Frobenius norm) and G is 3 πi 1 1 αi πi αi σi σi : ¼k2 À Ãk ∂ Þ þ ¼ð À Þ þ j ihj such that g 2 G for any g ∈ C. Popular choices for the learning rate 54 k k where the constant βC and norm ∥⋅∥C depend on C (ref. ). that assure convergence are αk 1=pk and αk = a/(b + k) for some a, b > 0. / In spite of the faster convergence rate, conjugate gradient methods fi In general, the projection step is theffiffiffi major drawback, which often limits require smooth cost functions (so that the gradient ∇C is well de ned at the applicability of the projected subgradient method to practical every point). However, cost functions based on trace distance (7) are not problems. Indeed, projections like Eq. (54) require another full optimization smooth. For instance, the trace distance in one-dimensional spaces = at each iteration that might be computationally intensive. Nonetheless, we reduces to the absolute value function ∣x∣ that is non-analytic at x 0. show in the following theorem that this issue does not occur in learning When some eigenvalues are close to zero, conjugate gradient methods quantum states, because the resulting optimization can be solved may display unexpected behaviors, though we have numerically observed analytically. that convergence is always obtained with a careful choice of the learning Theorem 3 Let X be a Hermitian operator in a d-dimensional Hilbert space rate. In the next section, we show how to formally justify the applicability = † of the conjugate gradient method, following Nesterov’s smoothing with spectral decomposition X UxU , where the eigenvalues xj are ordered in 41 decreasing order. Then X of Eq. (54) is given by prescription . PS ð Þ X UλUy; λi max xi θ; 0 ; (57) PSð Þ¼ s ¼ f À g Smoothing: smooth trace distance 1 where θ s xj 1 and The conjugate gradient method converges to the global optimum after ¼ j 1 À L 41 ¼ k steps, provided that the gradient of C is L-Lipschitz continuous . PÀÁ 1 O ϵ s max k 1; :::; d : x > x 1 : (58) However, the constant L can diverge for non-smooth functions like the k j ÀÁ ¼ ()2½ k j 1 À trace distance (7) so the convergence of the algorithm cannot be formally X¼ ÀÁ stated, although it may still be observed in numerical simulations. To Proof Any quantum (program) state can be written in the diagonal form solidify the convergence proof (see also Supplementary Note 5.2), we = † π VλV , where V is a unitary matrix, and λ is the vector of eigenvalues in introduce a smooth approximation to the trace distance. This is defined by ≥ ∑ = fi decreasing order, with λ j 0 and jλj 1. To nd the optimal state, it is the following cost function that is differentiable at every point required to find both the optimal unitary V and the optimal eigenvalues λ with the above property, i.e., Cμ π Tr hμ χπ χ hμ λj ; (64) ð Þ¼ ðÞÀ E ¼ j ð Þ X argmin X VλVy 2 : (59) ÂÃX PSð Þ¼ V;λ k À k where λj are the eigenvalues of χπ χ and hμ is the so-called Huber À E

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales L. Banchi et al. 9 penalty function 15. Jaggi, M. Convex optimization without projection steps. Preprint at https://arxiv.

x2 org/abs/1108.1170 (2011). 2μ if x < μ ; 16. Jaggi, M. Revisiting frank-wolfe: projection-free sparse convex optimization. In hμ x : j j (65) ð Þ ¼ μ Proceedings of the 30th International Conference on International Conference on ( x 2 if x μ : j jÀ j j – fi Machine Learning, Vol 28 I 427(2013). The previous de nition of the trace distance, C1 in Eq. (7), is recovered for fi → 17. Duchi, J., Shalev-Shwartz, S., Singer Y. & Chandra, T. Ef cient projections onto the μ 0 and, for any non-zero μ, the Cμ bounds C1 as follows l 1-ball for learning in high dimensions. In Proceedings of the 25th international μd conference on Machine learning, 272–279 (ACM, 2008). Cμ π C1 π Cμ π ; (66) ð Þ ð Þ ð Þþ 2 18. Liu, J., Musialski, P., Wonka, P. & Ye, J. Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 208–220 (2013). where d is the dimension of the program state π. In Supplementary Note 19. Ishizaka, S. & Hiroshima, T. Asymptotic teleportation scheme as a universal pro- 5.2, we then prove the following result grammable quantum processor. Phys. Rev. Lett. 101, 240501 (2008). Theorem 4 The smooth cost function C (π) is a convex function over μ 20. Ishizaka, S. & Hiroshima, T. Quantum teleportation scheme by selecting one of program states and its gradient is given by multiple output ports. Phys. Rev. A 79, 042306 (2009). Λ ∇Cμ π Ã hμ0 χπ χ ; (67) 21. Ishizaka, S. Some remarks on port-based teleportation. Preprint at https://arxiv. ð Þ¼ ½ ð À E Þ org/abs/1506.01555 (2015). where hμ0 is the derivative of hμ. Moreover, the gradient is L-Lipschitz 22. Lloyd, S. Universal quantum simulators. Science 273, 1073–1078 (1996). continuous with 23. Pirandola, S., Laurenza, R., Ottaviani, C. & Banchi, L. Fundamental limits of d repeaterless quantum communications. Nat. Commun. 8, 15043 (2017). L ; (68) ¼ μ 24. Pirandola, S. et al. Theory of channel simulation and bounds for private com- munication. Quant. Sci. Tech 3, 035009 (2018). where d is the dimension of the program state. 25. Nechita, I., Puchała, Z., Pawela, Ł.&Życzkowski, K. Almost all quantum channels Being Lipschitz continuous, the conjugate gradient algorithm and its are equidistant. J. Math. Phys. 59, 052201 (2018). variants41,54 converge up to an accuracy ϵ after L=ϵ steps. In some Oð Þ 26. Fuchs, C. A. & van de Graaf, J. Cryptographic distinguishability measures for applications, it is desirable to analyze the convergence in trace distance in – → ∞ quantum-mechanical states. IEEE Trans. Info. Theory 45, 1216 1227 (1999). the limit of large program states, namely for d . The parameter μ can 27. Pinsker, M. S. Information and information stability of random variables and pro- be chosen such that the smooth trace distance converges to the trace cesses (Holden-Day, San Francisco, 1964). distance, namely Cμ → C1 for d → ∞. Indeed, given the inequality (66), a 1 η 28. Carlen, E. A. & Lieb, E. H. Bounds for entanglement via an extension of strong possibility is to set μ dÀð þ Þ for some η > 0 so that, from Eq. (68), the – ¼Oð Þ 2 η subadditivity of entropy. Lett. Math. Phys. 101,1 11 (2012). convergence to the trace norm is achieved after d þ steps. Oð Þ 29. Watrous, J. Semidefinite programs for completely bounded norms. Theory Comput. 5, 217–238 (2009). 30. Watrous, J. Simpler semidefinite programs for completely bounded norms. Chi- DATA AVAILABILITY cago J. Theor. Comput. Sci. 8,1–19 (2013). The datasets generated and analyzed during the current study are available from the 31. Vandenberghe, L. & Boyd, S. Semidefinite programming. SIAM Rev. 38,49–95 corresponding author on reasonable request. (1996). 32. Chao, H.-H. First-Order Methods for Trace Norm Minimization (University of Cali- fornia, Los Angeles, 2013). CODE AVAILABILITY 33. Monteiro, R. D. C. First-and second-order methods for semidefinite programming. – The codes used for this study are available from the corresponding author on Math. Program. 97, 209 244 (2003). reasonable request. 34. Spall, J. C. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Trans. Automat. Contr. 45, 1839–1853 (2000). 35. Zhuang, Q. & Zhang, Z. Physical-layer supervised learning assisted by an Received: 21 November 2019; Accepted: 25 March 2020; entangled sensor network. Phys. Rev. X 9, 041023 (2019). 36. Harrow, A. & Napp, J. Low-depth gradient measurements can improve convergence in variational hybrid quantum-classical algorithms. Preprint at https:// arxiv.org/abs/1901.05374 (2019). 37. Cai, J.-F., Candès, E. J. & Shen, Z. A singular value thresholding algorithm for REFERENCES matrix completion. SIAM J. Optimiz. 20, 1956–1982 (2010). 1. Nielsen, M. A. & Chuang, I. L. Programmable quantum gate arrays. Phys. Rev. Lett. 38. Recht, B., Fazel, M. & Parrilo, P. A. Guaranteed minimum-rank solutions of linear – 79, 321 (1997). matrix equations via nuclear norm minimization. SIAM Rev. 52, 471 501 (2010). 2. Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information 39. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course, Vol 87 (Cambridge University Press, Cambridge, 2000). (Springer Science & Business Media, New York, 2013). 3. Watrous, J. The Theory of Quantum Information (Cambridge Univ. Press, 2018). 40. Coutts, B., Girard, M. & Watrous, J. Certifying optimality for convex quantum 4. Knill, E., Laflamme, R. & Milburn, G. J. A scheme for efficient quantum compu- channel optimization problems. Preprint at https://arxiv.org/abs/1810.13295 tation with linear optics. Nature 409, 46 (2001). (2018). 5. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006). 41. Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program. 103, – 6. Wittek, P. Quantum Machine Learning: What Quantum Computing Means to Data 127 152 (2005). Mining (Academic Press, Elsevier, 2014). 42. Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T. & Glaser, S. J. Optimal 7. Biamonte, J. et al. Quantum machine learning. Nature 549, 195 (2017). control of coupled spin dynamics: design of nmr pulse sequences by gradient – 8. Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum ascent algorithms. J. Magn. Reson. 172, 296 305 (2005). domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018). 43. Banchi, L., Pancotti, N. & Bose, S. Quantum gate learning in qubit networks: toffoli 9. Schuld, M., Sinayskiy, I. & Petruccione, F. An introduction to quantum machine gate without time-dependent control. npj Quantum Info. 2, 16019 (2016). learning. Contemp. Phys. 56, 172–185 (2015). 44. Innocenti, L. Banchi, L. Ferraro, A. Bose S. & M. Paternostro, M. Supervised learning 10. Ciliberto, C. et al. Quantum machine learning: a classical perspective. Proc. R. Soc. of time-independent hamiltonians for gate design. New J. Phys. (in press) https:// A 474, 20170551 (2018). doi.org/10.1088/1367-2630/ab8aaf (2020). 11. Tang, E. A quantum-inspired classical algorithm for recommendation systems. In 45. Mitarai, K., Negoro, M., Kitagawa, M. & Fujii, K. Quantum circuit learning. Phys. Rev. Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Com- A 98, 032309 (2018). puting (STOC 2019). Association for Computing Machinery, New York, NY, USA, 46. Bennett, C. H. et al. Teleporting an unknown quantum state via dual classical and 217–228. https://doi.org/10.1145/3313276.3316310 (2019). einstein-podolsky-rosen channels. Phys. Rev. Lett. 70, 1895 (1993). 12. Tang, E. Quantum-inspired classical algorithms for principal component analysis 47. Pirandola, S., Eisert, J., Weedbrook, C., Furusawa, A. & Braunstein, S. L. Advances in – and supervised clustering. Preprint at https://arxiv.org/abs/1811.00414 (2018). quantum teleportation. Nat. Photon. 9, 641 652 (2015). 13. Kitaev, A. Y., Shen, A. & Vyalyi, M. N. Classical and Quantum Computation,47 48. Pirandola, S., Laurenza, R., Lupo, C. & Pereira, J. L. Fundamental limits to quantum (American Mathematical Society, Providence, Rhode Island, 2002). channel discrimination. npj Quantum Info 5, 50 (2019). 14. Boyd, S., Xiao, L. & Mutapcic, A. “Subgradient methods.” lecture notes of EE392o, 49. Lloyd, S. Almost any quantum logic gate is universal. Phys. Rev. Lett. 75, 346 Stanford University, Autumn Quarter 2004 (2003):2004–2005. (1995).

Published in partnership with The University of New South Wales npj Quantum Information (2020) 42 L. Banchi et al. 10 50. D’Ariano, G. M. & Perinotti, P. Efﬁcient universal programmable quantum mea- ADDITIONAL INFORMATION surements. Phys. Rev. Lett. 94, 090401 (2005). Supplementary information is available for this paper at https://doi.org/10.1038/ 51. Pirandola, S., Bardhan, B. R., Gehring, T., Weedbrook, C. & Lloyd, S. Advances in s41534-020-0268-2. photonic quantum sensing. Nat. Photon 12, 724–733 (2018). 52. Uhlmann, A. The transition probability. Rep. Math. Phys. 9, 273–279 (1976). Correspondence and requests for materials should be addressed to L.B. 53. Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). Reprints and permission information is available at http://www.nature.com/ 54. Garber, D. & Hazan, E. Faster rates for the frank-wolfe method over strongly- reprints convex sets. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol 37, 541–549 (2015). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims 55. Bhatia, R. Matrix Analysis, Vol 169 (Springer Science & Business Media, New York, 2013). in published maps and institutional afﬁliations.

ACKNOWLEDGEMENTS L.B. acknowledges support by the program "Rita Levi Montalcini” for young researchers. S.P. and J.P. acknowledge support by the EPSRC via the ‘UK Quantum Open Access This article is licensed under a Creative Commons Communications Hub’ (Grants EP/M013472/1 and EP/T001011/1), and S.P. acknowl- Attribution 4.0 International License, which permits use, sharing, edges support by the European Union via the project ‘Continuous Variable Quantum adaptation, distribution and reproduction in any medium or format, as long as you give Communications’ (CiViQ, no 820466). appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless AUTHOR CONTRIBUTIONS indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory All authors contributed to prove the main theoretical results and to the writing of the regulation or exceeds the permitted use, you will need to obtain permission directly manuscript. from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. COMPETING INTERESTS The authors declare no competing interests © The Author(s) 2020

npj Quantum Information (2020) 42 Published in partnership with The University of New South Wales