<<

ARTICLE

https://doi.org/10.1038/s41467-021-21728-w OPEN Cost function dependent barren plateaus in shallow parametrized quantum circuits ✉ ✉ M. Cerezo 1,2 , Akira Sone1,2, Tyler Volkoff1, Lukasz Cincio1 & Patrick J. Coles1

Variational quantum algorithms (VQAs) optimize the parameters θ of a parametrized V(θ) to minimize a cost function C. While VQAs may enable practical applications of noisy quantum computers, they are nevertheless heuristic methods with V θ 1234567890():,; unproven scaling. Here, we rigorously prove two results, assuming ( ) is an alternating layered ansatz composed of blocks forming local 2-designs. Our first result states that defining C in terms of global leads to exponentially vanishing gradients (i.e., barren plateaus) even when V(θ) is shallow. Hence, several VQAs in the literature must revise their proposed costs. On the other hand, our second result states that defining C with local observables leads to at worst a polynomially vanishing gradient, so long as the depth of V(θ)isOðlog nÞ. Our results establish a connection between locality and trainability. We illustrate these ideas with large-scale simulations, up to 100 , of a quantum autoencoder implementation.

1 Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA. 2 Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, ✉ NM, USA. email: [email protected]; [email protected]

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w

ne of the most important technological questions is ψ ψ θ y 0 distance D T between 0 and ji¼ Vð Þ ji, given by 2 whether Noisy Intermediate-Scale Quantum (NISQ) C ¼ D ð ψ ; jiÞψ , which is equivalent to O 1 G T 0 computers will have practical applications . NISQ devices y are limited both in count and in gate fidelity, hence C ¼ Tr½O VðθÞ ψ ψ VðθÞ Š ; ð1Þ G G p0 ffiffiffiffiffiffi 0 preventing the use of . 1 0 0 ≥ ψ ψ ψ ψ with OG ¼ À jihj.Notethat CG jhiÀjMj 0jMj 0 j The leading strategy to make use of these devices is variational has a nice operational meaning as a bound on the expectation value quantum algorithms (VQAs)2. VQAs employ a quantum com- fi difference for a POVM element M. puter to ef ciently evaluate a cost function C, while a classical However, here we argue that this cost function and others like optimizer trains the parameters θ of a Parametrized Quantum θ fi it exhibit exponentially vanishing gradients. Namely, we consider Circuit (PQC) V( ). The bene ts of VQAs are three-fold. First, global cost functions, where one directly compares states or VQAs allow for task-oriented programming of quantum com- operators living in exponentially large Hilbert spaces (e.g., jiψ puters, which is important since designing quantum algorithms is and ψ ). These are precisely the cost functions that have non-intuitive. Second, VQAs make up for small qubit counts by 0 operational meanings for tasks of interest, including all tasks in leveraging classical computational power. Third, pushing com- – refs. 11 29. Hence, our results imply that a non-trivial subset of plexity onto classical computers, while only running short-depth these references will need to revise their choice of C. quantum circuits, is an effective strategy for error mitigation on Interestingly, we demonstrate vanishing gradients for shallow NISQ devices. PQCs. This is in contrast to McClean et al.9, who showed van- There are very few rigorous scaling results for VQAs (with ishing gradients for deep PQCs. They noted that randomly exception of one-layer approximate optimization3–5). Ideally, in initializing θ for a V(θ) that forms a 2-design leads to a barren order to reduce gate overhead that arises when implementing on plateau, i.e., with the gradient vanishing exponentially in the quantum hardware one would like to employ a hardware-efficient number of qubits, n. Their work implied that researchers must ansatz6 for V(θ). As recent large-scale implementations for develop either clever parameter initialization strategies32,33 or chemistry7 and optimization8 applications have shown, this clever PQCs ansatzes4,34,35. Similarly, our work implies that ansatz leads to smaller errors due to hardware noise. However, researchers must carefully weigh the balance between trainability one of the few known scaling results is that deep versions of and operational relevance when choosing C. randomly initialized hardware-efficient ansatzes lead to expo- While our work is for general VQAs, barren plateaus for global nentially vanishing gradients9. Very little is known about the cost functions were noted for specific VQAs and for a very spe- scaling of the gradient in such ansatzes for shallow depths, and it cific tensor-product example by our research group14,18, and would be especially useful to have a converse bound that guar- more recently in29. This motivated the proposal of local cost antees non-exponentially vanishing gradients for certain depths. – functions14,16,18,22,25 27, where one compares objects (states or This motivates our work, where we rigorously investigate the operators) with respect to each individual qubit, rather than in a gradient scaling of VQAs as a function of the circuit depth. global sense, and therein it was shown that these local cost The other motivation for our work is the recent explosion in functions have indirect operational meaning. the number of proposed VQAs. The Variational Quantum Our second result is that these local cost functions have Eigensolver (VQE) is the most famous VQA. It aims to prepare gradients that vanish polynomially rather than exponentially in n, the of a given Hamiltonian H = ∑αcασα, with H and hence have the potential to be trained. This holds for V(θ) expanded as a sum of local Pauli operators10.InVQE,thecost with depth Oðlog nÞ. Figure 1 summarizes our two main results. function is obviously the energy C ¼ hiψjHjψ of the trial state Finally, we illustrate our main results for an important exam- jiψ . However, VQAs have been proposed for other applica- ple: quantum autoencoders11. Our large-scale numerics show that tions, like quantum data compression11, quantum error the global cost function proposed in11 has a barren plateau. On correction12, quantum metrology13, quantum compiling14–17, the other hand, we propose a novel local cost function that is diagonalization18,19,quantumsimulation20–23, trainable, hence making quantum autoencoders a scalable fidelity estimation24,unsampling25, consistent histories26,and application. linear systems27–29. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs Results as ground-state problems (which can be done in many cases), Warm-up example. To illustrate cost-function-dependent barren the choice of Hamiltonian H is less intuitive. This is because fi many of these applications are abstract, rather than associated plateaus, we rst consider a toy problem corresponding to the with a physical Hamiltonian. state preparation problem in the Introduction with the target ji0 We remark that polynomially vanishing gradients imply that state beingN . We assume a tensor-product ansatz of the form n θjσðjÞ= θ Ài x 2 fi θj the number of shots needed to estimate the gradient should grow Vð Þ¼ j¼1e , with the goal of nding the angles such as Oðpoly ðnÞÞ. In contrast, exponentially vanishing gradients that VðθÞji0 ¼ ji0 . Employing the global cost of (1) results in Q j (i.e., barren plateaus) imply that derivative-based optimization C ¼ 1 À n cos2 θ . The barren plateau can be detected via the 30 G j¼1 2 will have exponential scaling , and this scaling can also apply to ∂ nÀ1 ½ CGŠ¼1 ð3Þ derivative-free optimization31. Assuming a polynomial number of variance of its gradient: Var ∂θj 8 8 , whichDE is exponen- ∂ shots per optimization step, one will be able to resolve against CG tially vanishing in n. Since the mean value is ∂θj ¼ 0, the finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for gradient concentrates exponentially around zero. On the other hand, consider a local cost function: polynomially vanishing gradients. hi In this work, we connect the trainability of VQAs to the choice C ¼ Tr O VðθÞji0 hj0 VðθÞy ; ð2Þ of C. For the abstract applications in refs. 11–29, it is important for L L C to be operational, so that small values of C imply that the task is n 1 1 ∑ 1 ; almost accomplished. Consider an example of state preparation, with OL ¼ À ji0 hj0 j j ð3Þ n j¼1 where the goal is to find a gate sequence that prepares a target ψ 1 state 0 . A natural cost function is the square of the trace where j is the identity on all qubits except qubit j. Note that CL

2 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w ARTICLE

Proposition 1 Let θj be uniformly distributed on [−π, π] ∀j. For any δ ∈ (0, 1), the probability that C ≤ δ satisfies G  1 n PrfC ≤ δg ≤ ð1 À δÞÀ1 : ð4Þ G 2 δ 1 ; ≤ δ fi For any 2½2 1Š, the probability that CL satis es ð2δ À 1Þ2 PrfC ≤ δg ≥ À! 1 : ð5Þ L 1 δ 2 n!1 2n þð2 À 1Þ

General framework. For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows Âà C ¼ Tr OVðθÞρVyðθÞ ; ð6Þ where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where ρ ¼ ji0 hj0 , which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA. It is typical to express O as a linear combination of the form 1 ∑N ≠ 1 R O ¼ c0 þ i¼1 ciOi. Here Oi , ci 2 , and we assume that at ≠ least one ci 0. Note that CG and CL in (1) and (2) fall under this framework. In our main results below, we will consider two Fig. 1 Summary of our main results. McClean et al.9 proved that a barren different choices of O that respectively capture our general plateau can occur when the depth D of a hardware-efficient ansatz is notions of global and local cost functions and also generalize the D 2 ðpoly ðnÞÞ. Here we extend these results by providing bounds for the O aforementioned CG and CL. variance of the gradient of global and local cost functions as a function of D. As shown in Fig. 3a, V(θ) consists of L layers of m-qubit In particular, we find that the barren plateau phenomenon is cost-function θ unitaries Wkl( kl), or blocks, acting on alternating groups of m dependent. a For global cost functions (e.g., Eq. (1)), the landscape will neighboring qubits. We refer to this as an Alternating Layered exhibit a barren plateau essentially for all depths D. b For local cost Ansatz. We remark that the Alternating Layered Ansatz will be a functions (e.g., Eq. (2)), the gradient vanishes at worst polynomially and hardware-efficient ansatz so long as the gates that compose each hence is trainable when D 2 Oðlog ðnÞÞ, while barren plateaus occur for block are taken from a set of gates native to a specific device. As D 2 Oðpoly ðnÞÞ, and between these two regions the gradient transitions depicted in Fig. 3c, the one dimensional Alternating Layered from polynomial to exponential decay. Ansatz can be readily implemented in devices with one- dimensional connectivity, as well as in devices with two- dimensional connectivity (such as that of IBM’s36 and Google’s37 14,16 = ⇔ = quantum devices). That is, with both one- and two-dimensional vanishes under the same conditions as CG , CL 0 CG 0. fi 1 ∑n 2 θj hardware connectivity one can group qubits to form an We nd CL ¼ 1 À n j¼1 cos 2 , and the variance of its gradient is ∂ Alternating Layered Ansatz as in Fig. 3a. CL 1 = … θ Var½ ∂θj Š¼8n2, which vanishes polynomially with n and hence The index l 1, , L in Wkl( kl) indicates the layer that exhibits no barren plateau. Figure 2 depicts the cost landscapes of contains the block, while k = 1, …, ξ indicates the qubits it acts = ξ CG and CL for two values of n and shows that the barren plateau upon. We assume n is a multiple of m, with n m , and that m fi can be avoided here via a local cost function. does not scale with n. As depicted in Fig. 3a, we de ne Sk as the fi Moreover, this example allows us to delve deeper into the m-qubit subsystem on which WkL acts, and we de ne S¼fSkg as cost landscape to see a phenomenon that we refer to as a the set of all such subsystems. Let us now consider a block θ narrow gorge. While a barren plateau is associated with a flat Wkl( kl) in the lth layer of the ansatz. For simplicity we landscape, a narrow gorge refers to the steepness of the valley henceforth use W to refer to a given W (θ ). As shown in the ν kl kl that contains the global minimum. This phenomenon is Methods section, given a θ ∈ θ that parametrizes a rotation ν kl Àiθ σν =2 illustrated in Fig. 2, where each dot corresponds to cost values e (with σν a Pauli operator) inside a given block W, one θ obtained from randomly selected parameters .ForCG we see can always express that very few dots fall inside the narrow gorge, while for CL the ∂W Ài narrow gorge is not present. Note that the narrow gorge ν :¼ ∂νW ¼ W σνW ; ð7Þ ∂θ 2 A B makesithardertotrainCG since the learning rate of descent- based optimization algorithms must be exponentially small in where WA and WB contain all remaining gates in W, and are order not to overstep the narrow gorge. The following properly defined in the Methods section. ν proposition (proved in the Supplementary Note 2) formalizes The contribution to the gradient ∇C from a parameter θ in the ∂ the narrow gorge for CG and its absence for CL by block W is given by the partial derivative νC. While the value of characterizing the dependence on n of the probability C ⩽ δ. ∂νC depends on the specific parameters θ, it is useful to compute ∂ θ This probability is associated with the parameter space volume h νCiV , i.e., the average gradient over all possible unitaries V( ) that leads to C ⩽ δ. within the ansatz. Such an average may not be representative near

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w

Q C n 2 θj= n = n = Fig. 2 Cost function landscapes. a Two-dimensional cross-section through the landscape of G ¼ 1 À j¼1 cos ð 2Þ for 4 (blue) and 24 C 1 ∑n 2 θj= n (orange). b The same cross-section through the landscape of L ¼ 1 À n j¼1 cos ð 2Þ is independent of . In both cases, 200 Haar distributed points are shown, with very few (most) of these points lying in the valley containing the global minimum of CG (CL). the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V (θ). In the Methods Section we explicitly show how to compute 〈…〉 averages of the form V, and in the Supplementary Note 3 we provide a proof for the following Proposition.

Proposition 2 The average of the partial derivative of any cost function of the form (6) with respect to a parameter θν in a block W of the ansatz in Fig. 3 is ∂ ; h νCiV ¼ 0 ð8Þ provided that either WA or WB of (7) form a 1-design. Here we recall that a t-design is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments38. The Methods section provides a formal definition of a t-design. Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:

∂ ∂ 2 ; Var½ νCŠ¼ ðÞνC V ð9Þ where we used the fact that h∂νCi ¼ 0. The magnitude of Var V Fig. 3 Alternating Layered Ansatz. a Each block Wkl acts on m qubits and [∂νC] quantifies how much the partial derivative concentrates is parametrized via (27). As shown, we define Sk as the m-qubit subsystem around zero, and hence small values in (9) imply that the slope of on which WkL acts, where L is the last layer of V(θ). Given some block W, the landscape will typically be insufficient to provide a cost- fi ’ it is useful for our proofs (outlined in the Methods) to write minimizing direction. Speci cally, from Chebyshev s inequality, V θ V 1 W V V ð Þ¼ Rð w Þ L, where R contains all gates in the forward light- Var[∂νC] bounds the probability that the cost-function partial cone L of W. The forward light-cone L is defined as all gates with at least derivative deviates from its mean value (of zero) as W fi 2 one input qubit causally connected to the output qubits of .Wede ne L PrðÞj∂νCj ≥ c ≤ Var½∂νCŠ=c for all c >0. as the compliment of L, Sw as the m-qubit subsystem on which W acts, and Sw as the n − m qubit subsystem on which W acts trivially. b The operator Main results. Here we present our main theorems and corollaries, O S O i acts nontrivially only in subsystem kÀ1 2 S, while i0 acts nontrivially on with the proofs sketched in the Methods and detailed in the the first m/2 qubits of Sk+1, and on the second m/2 qubits of Sk. c Depiction Supplementary Information. In addition, in the Methods section of the Alternating Layered Ansatz with one- and two-dimensional we provide some intuition behind our main results by analyzing a connectivity. Each circle represents a physical qubit. generalization of the warm-up example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap acting on subsystem S , or (ii) When N is arbitrary and Ob is between the warm-up example and our main theorems and also k N ik b2 ≤ m b m σμ showcases the tools used to derive our main result. traceless with Tr½OikŠ 2 (for example, when Oik¼ j¼1 j is a σμ 1 ; σx; σy; σz The following theorem provides an upper bound on the tensor product of Pauli operators j 2f j j j j g, with at μ variance of the partial derivative of a global cost function which least one σ ≠ 1). Note that case (i) includes C of (1)asa can be expressed as the expectation value of an operator of the j G special case. form N 1 ∑ b b b : O ¼ c0 þ ciOi1 Oi2 ÁÁÁ Oiξ ð10Þ Theorem 1 i¼1 θν fi = Consider a trainable parameter in a block W of the ansatz in Speci cally, we consider two cases of interest: (i) When N 1 and Fig. 3. Let Var[∂νC] be the variance of the partial derivative of a b b2 b ≠1 θν each O1k is a non-trivial projector (O1k ¼ O1k ) of rank rk global cost function C (with O given by (10)) with respect to .If

4 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w ARTICLE

W , W of (7), and each block in V(θ) form a local 2-design, then Theorem 2 A B ν Var[∂νC] is upper bounded by Consider a trainable parameter θ in a block W of the ansatz in Fig. 3. Let Var[∂νC] be the variance of the partial derivative of an Var½∂νCŠ ⩽ F ðL; lÞ : ð11Þ n m-local cost function C (with O given by (16)) with respect to θν. θ = b WA, WB of (7), and each block in V( ) form a local 2-design, then (i) For N 1 and whenQ each O1k is a non-trivial projector, ∂ fi ξ 2 Var[ νC] is lower bounded by then de ning R ¼ k¼1 rk, we have G ðL; lÞ ⩽ Var½∂νCŠ ; ð17Þ 22mþð2mÀ1ÞðLÀlÞ n F ðL; lÞ¼ c2R : ð12Þ n 2m n ð2À 3 Þn 1 ð2 À 1ÞÁ3m Á 2 m with b fi b 2mðlþ1ÞÀ1 (ii) For arbitrary N and when each Oik satis es Tr½OikŠ¼0 and ; 2 GnðL lÞ¼ 2 Lþl Tr½Ob Š ⩽ 2m, then ð22m À 1Þ ð2m þ 1Þ ik ð18Þ 2mðLÀlþ1Þþ1 ´ ∑ ∑ 2ϵ ρ ϵ b ; 2 N c ð ; 0 Þ ðOiÞ ; 0 i k k ; ∑ : i2iL ðk k Þ2kL FnðL lÞ¼ 4 cicj ð13Þ B 2n 3À n ; 0⩾ 3m Á 2ðÞm i j¼1 k k b where iL is the set of i indices whose associated operators Oi act on qubits in the forward light-cone L of W, and k is the From Theorem 1 we derive the following corollary. LB set of k indices whose associated subsystems Sk are in the back- fi ϵ ward light-cone LB of W. Here we de ned the function ðMÞ¼ Corollary 1 ; 1= – DHSðÞM TrðMÞ dM where DHS is the Hilbert Schmidt distance Consider the function Fn(L, l). ρ and dM is the dimension of the matrix M. In addition, k;k0 is the = b partial trace of the input state ρ down to the subsystems (i) Let N 1 and let each O1k be a non-trivial projector, as in 2 n ::: case (i) of Theorem 1. If c R 2Oð2 Þ and if the number of SkSkþ1 Sk0 . 1 ϵ b layers L 2Oðpoly ðlog ðnÞÞÞ, then Let us make a few remarks. First, note that the ðOiÞ in the  θ b À 1À 1 log 3 n lower bound indicates that training V( ) is easier when Oi is far F ðÞ2OL; l 2 ðÞm 2 ; ð14Þ ϵ ρ n from the identity. Second, the presence of ð k;k0 Þ in Gn(L, l) ∂ implies that we have no guarantee on the trainability of a which implies that Var[ νC] is exponentially vanishing in n θν ρ if m ⩾ 2. parameter in W if is maximally mixed on the qubits in the (ii) Let N be arbitrary, and let each Ob satisfy Tr½Ob Š¼0 and backwards light-cone. ik ik From Theorem 2 we derive the following corollary for m-local b2 ⩽ m n Tr½OikŠ 2 , as in case (ii) of Theorem 1. If N 2Oð2 Þ, cost functions, which guarantees the trainability of the ansatz for ci 2Oð1Þ, and if the number of layers shallow circuits. L 2Oðpoly ðlog ðnÞÞÞ, then  ; 1 ; Corollary 2 FnðÞ2OL l ð15Þ ðÞ1À 1 n Consider the function Fn(L, l). Let O be an operator of the form 2 m 2 b (16), as in Theorem 2. If at least one term c ϵðρ 0 ÞϵðO Þ in the ∂ i k;k i which implies that Var[ νC] is exponentially vanishing in n sum in (18) vanishes no faster than Ω(1/poly(n)), and if the ⩾ if m 2. number of layers L is Oðlog ðnÞÞ, then  Let us now make several important remarks. First, note that 1 part (i) of Corollary 1 includes as a particular example the cost G ðL; lÞ2Ω : ð19Þ n poly ðnÞ function CG of (1). Second, part (ii) of this corollary also includes as particular examples operators with N 2Oð1Þ, as well as 2 b On the other hand, if at least one term c ϵðρ ; 0 ÞϵðO Þ in the sum N 2Oðpoly ðnÞÞ. Finally, we remark that F (L, l) becomes trivial ÀÁi k k i n in (18) vanishes no faster than Ω 1=2poly ðlog ðnÞÞ , and if the when the number of layers L is Ω(poly(n)), however, as we fi ∂ number of layers is Oðpoly ðlog ðnÞÞÞ, then discuss below, we can still nd that Var[ νCG] vanishes  exponentially in this case. 1 G ðL; lÞ2Ω : ð20Þ Our second main theorem shows that barren plateaus can be n 2poly ðlog ðnÞÞ avoided for shallow circuits by employing local cost functions. b Hence, when L is Oðpoly ðlog ðnÞÞÞ there is a transition region Here we consider m-local cost functions where each Oi acts nontrivially on at most m qubits and (on these qubits) can be where the lower bound vanishes faster than polynomially, but μ μ0 slower than exponentially. b b i b expressed as Oi ¼ Oi Oi : We finally justify the assumption of each block being a local N μ μ0 2-design from the fact that shallow circuit depths lead to such 1 ∑ b i b ; O ¼ c0 þ ciOi Oi ð16Þ local 2-designs. Namely, it has been shown that one-dimensional i¼1 fi μ 2-designs have ef cient quantum circuit descriptions, requiring b i 2 38 where Oi are operators acting on m/2 qubits which can be Oðm Þ gates to be exactly implemented ,orOðmÞ to be written as a tensor product of Pauli operators. Here, we assume approximately implemented39,40. Hence, an L-layered ansatz in the summation in Eq. (16) includes two possible cases as which each block forms a 2-design can be exactly implemented μ μ0 2 b i b with a depth D 2Oðm LÞ, and approximately implemented with schematically shown in Fig. 3b: First, when Oi (Oi ) acts on the μ μ0 D 2OðmLÞ. For the case of two-dimensional connectivity, it has first (last) m/2 qubits of a given S , and second, when Ob i (Ob ) k i i beenpffiffiffiffi shown that approximate 2-designs require a circuit depth of acts on the last (first) m/2 qubits of a given S (S + ). This type of 40 k k 1 Oð mÞ to be implemented . Therefore,pffiffiffiffi in this case the depth of cost function includes any ultralocal cost function (i.e., where the the layered ansatz is D 2Oð mLÞ. The latter shows that b Oi are one-body) as in (2), and also VQE Hamiltonians with up to increasing the dimensionality of the circuit reduces the circuit m/2 neighbor interactions. Then, the following theorem holds. depth needed to make each block a 2-design.

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w

Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate one-dimensional 2-design on n qubits if the number of layers is OðnÞ40. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of9 for both local and global cost functions.

Numerical simulations. As an important example to illustrate the cost-function-dependent barren plateau phenomenon, we consider quantum autoencoders11,41–44. In particular, the pio- neering VQA proposed in ref. 11 has received significant literature attention, due to its importance to learning and quantum data compression. Let us briefly explain the algorithm in ref. 11. Consider a bipartite quantum system AB composed of nA and nB qubits, respectively, and let fpμ; jψμig be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state jψμi with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ). 11 To quantify the degree of data compression, ref. proposed a Fig. 4 Alternating Layered Ansatz for V(θ) employed in our numerical cost function of the form: simulations. Each layer is composed of control-Z gates acting on 0 0 0 ρout alternating pairs of neighboring qubits which are preceded and followed by CG ¼ 1 À Tr½jihjB Šð21Þ Àiθ σ =2 single qubit rotations around the y-axis, RyðθiÞ¼e i y . Shown is the case 0 θ ρin θ y ; of two layers, n = 1, and n = 10 qubits. The number of variational ¼ Tr½OGVð Þ ABVð Þ Š ð22Þ A B parameters and gates scales linearly with n : for the case shown there are ρin ∑ ψ ψ B where AB ¼ μpμj μih μj is the ensemble-average input state, 71 gates and 51 parameters. While each block in this ansatz will not form an ρout ∑ ψ0 ψ0 B ¼ μpμTrA½j ih jŠ is the ensemble-average trash state, exact local 2-design, and hence does not fall under our theorems, one can ψ0 θ ψ 0 still obtain a cost-function-dependent barren plateau. and ji¼Vð Þj μi. Equation (22) makes it clear that CG has the 0 1 1 0 0 form in (6), and OG ¼ AB À A jihjis a global fi of the form in (10). Hence, according to Corollary 1, C0 exhibits a this ansatz is a simpli ed version of the ansatz in Fig. 3, as we can G only generate unitaries with real coefficients. All parameters in V barren plateau for large n . (Specifically, Corollary 1 applies in B (θ) were randomly initialized and as detailed in the Methods this context when n < n ). As a result, large-scale data A B section, we employ a gradient-free training algorithm that gra- compression, where one is interested in discarding large numbers dually increases the number of shots per cost-function evaluation. of qubits, will not be possible with C0 . G Analysis of the n-dependence. Figure 5 shows representative To address this issue, we propose the following local cost results of our numerical implementations of the quantum function hi autoencoder in ref. 11 obtained by training V(θ) with the global nB 0 1 ∑ 1 ρout and local cost functions respectively given by (22) and (23). CL ¼ 1 À Trji 0 hj0 j j B ð23Þ fi fi fi nB j¼1 Speci cally, while we train with nite sampling, in the gures we show the exact cost-function values versus the number of ¼ Tr½O0 VðθÞρin VðθÞyŠ ; ð24Þ iterations. Here, the top (bottom) axis corresponds to the number L AB 0 0 = n of iterations performed while training with CG (CL). For nB 10 where O0 ¼ 1 À 1 ∑ B 1 ji0 hj0 1 , and 1 is the θ L AB nB j¼1 A j j j and 15, Fig. 5 shows that we are able to train V( ) for both cost = identity on all qubits in B except the jth qubit. As shown in the functions. For nB 20, the global cost function initially presents a 0 fi 0 ⩽ 0 ⩽ 0 Supplementary Note 9, CL satis es CL CG nBCL, which plateau in which the optimizing algorithm is not able to 0 implies that CL is faithful (vanishing under the same conditions determine a minimizing direction. However, as the number of 0 0 as CG). Furthermore, note that OL has the form in (16). Hence shots per function evaluation increases, one can eventually 0 0 Corollary 2 implies that CL does not exhibit a barren plateau for minimize CG. Such result indicates the presence of a barren shallow ansatzes. plateau where the gradient takes small values which can only be Here we simulate the autoencoder algorithm to solve a simple detected when the number of shots becomes sufficiently large. In problem where nA = 1, and where the input state ensemble this particular example, one is able to start training at around 140 fp ; jψ ig is given by iterations. μ μ When n > 20 we are unable to train the global cost function, ψ ¼ ji0 ji0; 0; 0; ¼ ; 0 ; with p ¼ 2=3 ; ð25Þ B 1 A B 1 while always being able to train our proposed local cost function.

0 0 ψ ¼ ji1 ji1; 1; 0; ¼ ; 0 ; with p ¼ 1=3 : ð26Þ Note that the number of iterations is different for CG and CL,as 2 A B 2 for the global cost function case we reach the maximum number In order to analyze the cost-function-dependent barren plateau of shots in fewer iterations. These results indicate that the global phenomenon, the dimension of subsystem B is gradually cost function of (22) exhibits a barren plateau where the gradient increased as nB = 10, 15, …, 100. of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We Numerical results. In our heuristics, the gate sequence V(θ)is remark that in principle one can always find a minimizing 0 given by two layers of the ansatz in Fig. 4, so that the number of direction when training CG, although this would require a θ gates and parameters in V( ) increases linearly with nB. Note that number of shots that increases exponentially with nB. Moreover,

6 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w ARTICLE

Fig. 5 Cost versus number of iterations for the quantum autoencoder problem defined by Eqs. (25)–(26). In all cases we employed two layers of the n = n = … C0 ansatz shown in Fig. 4, and we set A 1, while increasing B 10, 15, , 100. The top (bottom) axis corresponds to the global cost function G of Eq. (22) C0 C0 n = C0 (local cost function L of (23)). As can be seen, G can be trained up to B 20 qubits, while L trained in all cases. These results indicate that global cost function presents a barren plateau even for a shallow depth ansatz, and this can be avoided by employing a local cost function.

On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train 0 0 CL for all considered values of nB. Moreover, as seen in Fig. 5, CL can be trained with a small number of shots per cost-function evaluation (as small as 10 shots per evaluation). Analysis of the L-dependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the 0 dependence on L when training CL for the autoencoder example = = with nA 1 and nB 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in Fig. 6 Local cost C0 versus number of iterations for the quantum L, although we remark that this behavior can saturate for very L large L9. autoencoder problem in Eqs. (25–26) with nB = 10. Each curve corresponds to a different number of layers L in the ansatz of Fig. 4 with L In summary, even though the ansatz employed in our fi = 2,…, 20. Curves were averaged over 9 instances of the autoencoder. As heuristics is beyond the scope of our theorems, we still nd the number of layers increases, the optimization becomes harder. Inset: cost-function-dependent barren plateaus, indicating that the cost- C0 : C0 : function dependent barren plateau phenomenon might be more Number of shots needed to reach cost values of L ¼ 0 02 and L ¼ 0 05 versus number of layers L.AsL increases the number of shots needed to general and go beyond our analytical results. reach the indicated cost values appears to increase exponentially. Discussion one can see in Fig. 5 that randomly initializing the parameters While scaling results have been obtained for classical neural 0 always leads to CG  1 due to the narrow gorge phenomenon (see networks45, very few such results exist for the trainability of Proposition 1), i.e., where the probability of being near the global parametrized quantum circuits, and more generally for quantum minimum vanishes exponentially with nB. neural networks. Hence, rigorous scaling results are urgently

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w needed for VQAs, which many researchers believe will provide Moreover, our numerics suggest that our theorems (which are the path to quantum advantage with near-term quantum com- stated for exact 2-designs) might be extendable in some form to puters. One of the few such results is the barren plateau theorem ansatzes composed of simpler blocks, like approximate 2- of ref. 9, which holds for VQAs with deep, hardware-efficient designs39. ansatzes. We emphasize that while our theorems are stated for a In this work, we proved that the barren plateau phenomenon hardware-efficient ansatz and for costs that are of the form (6), it extends to VQAs with randomly initialized shallow Alternating remains an interesting open question as to whether other ansatzes, Layered Ansatzes. The key to extending this phenomenon to cost function, and architectures exhibit similar scaling behavior as shallow circuits was to consider the locality of the operator O that that stated in our theorems. For instance, we have recently defines the cost function C. Theorem 1 presented a universal shown50 that our results can be extended to a more general type of upper bound on the variance of the gradient for global cost Quantum Neural Network called dissipative quantum neural functions, i.e., when O is a global operator. Corollary 1 stated the networks51. Another potential example of interest could be the asymptotic scaling of this upper bound for shallow ansatzes as unitary-coupled cluster (UCC) ansatz in chemistry52, which is being exponentially decaying in n, indicating a barren plateau. intended for use in the Oðpoly ðnÞÞ depth regime34. Therefore it is Conversely, Theorem 2 presented a universal lower bound on the important to study the key mathematical features of an ansatz that variance of the gradient for local cost functions, i.e., when O is a might allow one to go from trainability for Oðlog nÞ depth (which sum of local operators. Corollary 2 notes that for shallow ansatzes we guarantee here for local cost functions) to trainability for this lower bound decays polynomially in n. Taken together, these Oðpoly nÞ depth. two results show that barren plateaus are cost-function-depen- Finally, we remark that some strategies have been developed to dent, and they establish a connection between locality and mitigate the effects of barren plateaus32,33,53,54. While these trainability. methods are promising and have been shown to work in certain In the context of chemistry or materials science, our present cases, they are still heuristic methods with no provable guarantees work can inform researchers about which transformation to use that they can work in generic scenarios. Hence, we believe that when mapping a fermionic Hamiltonian to a Hamiltonian46, more work needs to be done to better understand how to prevent, i.e., Jordan-Wigner versus Bravyi–Kitaev47. Namely, the avoid, or mitigate the effects of barren plateaus. Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost Methods function. This fact was recently numerically confirmed48. In this section, we provide additional details for the results in the main text, as well Moreover, the fact that Corollary 2 is valid for arbitrary input as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. quantum states may be useful when constructing variational More detailed proofs of our theorems are given in the Supplementary Information. ansatzes. For example, one could propose a growing ansatz method where one appends log ðnÞ layers of the hardware- fi fi fi Variance of the cost function partial derivative.Letus rst discuss the formulas ef cient ansatz to a previously trained (hence xed) circuit. This we employed to compute Var[∂νC]. Let us first note that without loss of generality, θ could then lead to a layer-by-layer training strategy where the any block Wkl( kl) in the Alternating Layered Ansatz can be written as a product of ζ θ previously trained circuit can correspond to multiple layers of the kl independent gates from a gate alphabet A ¼fGμð Þg as fi ζ ν same hardware-ef cient ansatz. kl 1 W ðθ Þ¼Gζ ðθ Þ ¼ Gν ðθ Þ ¼ G ðθ Þ ; ð27Þ We remark that our definition of a global operator (local kl kl kl kl kl 1 kl θν θν θν where each kl is a continuous parameter. Here, Gν ð klÞ¼Rνð klÞQν where Qν is operator) is one that is both non-local (local) and many body (few ν θν σ = θ Ài kl ν 2 σ an unparametrized gate and Rν ð klÞ¼e with ν a Pauli operator. Note that body). Therefore, the barren plateau phenomenon could be due θ WkL denotes a block in the last layer of V( ). to the many-bodiness of the operator rather than the non-locality For the proofs of our results, it is helpful to conceptually break up the ansatz as θ of the operator; we leave the resolution of this question to future follows. Consider a block Wkl( kl) in the lth layer of the ansatz. For simplicity, θ work. On the other hand, our Theorem 1 rules out the possibility we henceforth use W to refer to a given Wkl( kl). Let Sw denote the m-qubit − that barren plateaus could be due to cardinality, i.e., the number subsystem that contains the qubits W acts on, and let Sw be the (n m) subsystem 49 on which W acts trivially. Similarly, let Hw and Hw denote the Hilbert spaces of terms in O when decomposed as a sum of Pauli products . θ associated with Sw and Sw, respectively. Then, as shown in Fig. 3a, V( ) can be Namely, case (ii) of this theorem implies barren plateaus for O of expressed as essentially arbitrary cardinality, and hence cardinality is not the θ 1 : key variable at work here. Vð Þ¼VRð w WÞVL ð28Þ 1 We illustrated these ideas for two examples VQAs. In Fig. 2,we Here, w is the identity on Hw, and VR contains the gates in the (forward) light- considered a simple state-preparation example, which allowed us cone L of W, i.e., all gates with at least one input qubit causally connected to the fi to delve deeper into the cost landscape and uncover another output qubits of W. The latter allows us to de ne SL as the subsystem of all qubits in L. phenomenon that we called a narrow gorge, stated precisely in Let us here recall that the Alternating Layered Ansatz can be implemented with Proposition 1. In Fig. 5, we studied the more important example either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We of quantum autoencoders, which have generated significant remark that the following results are valid for both cases as the light-cone structure interest in the quantum community. Our will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in numerics showed the effects of barren plateaus: for more than 20 what follows. qubits we were unable to minimize the global cost function Let us now assume that θν is a parameter inside a given block W, we obtain introduced in11. To address this, we introduced a local cost from (6), (27), and (28) function for quantum autoencoders, which we were able to h ∂ i 1 ρ y 1 y ν C ¼ Tr ð w WBÞVL VLð w W B Þ minimize for system sizes of up to 100 qubits. 2 i ð29Þ There are several directions in which our results could be ´ 1 σ ; 1 y y 1 ; ½ w ν ð w W ÞV ROVRð w WAފ generalized in future work. Naturally, we hope to extend the A narrow gorge phenomenon in Proposition 1 to more general with VQAs. In addition, we hope in the future to unify our theorems 1 YνÀ1 Yζ θμ ; θμ : and 2 into a single result that bounds the variance as a function of WB ¼ Gμð Þ and WA ¼ Gμð Þ ð30Þ a parameter that quantifies the locality of O. This would further μ¼1 μ¼ν solidify the connection between locality and trainability.

8 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w ARTICLE

in those of W†. Then, this finite set is a t-design if38

1 ∑ hPðt;tÞðWÞi ¼ Á Pðt;tÞðWyÞ w jYj y2Y Z ð37Þ μ : ¼ d ðWÞPðt;tÞðWÞ

From the general form of C in Eq. (6) we can see the cost function is a θ polynomial of degree at most 2 in the matrix elements of each block Wkl in V( ), y and at most 2 in those of ðWklÞ . Then, if a given block W forms a 2-design, one can employ the following elementwise formula of the Weingarten calculus55,56 to explicitly evaluate averages over W up to the second moment: R R  δ 0 δ 0 Δ μ Ã ii jj μ Ã Ã 1 Δ 2 d ðWÞwijwi0 j0 ¼ m d ðWÞwi j wi j wi0 j0 wi0 j0 ¼ 2m À m ð38Þ 2 1 1 2 2 1 2 2 2 2 À1 1 2

where wij are the matrix elements of W, and Δ δ δ δ δ δ δ δ δ ; 1 ¼ i i0 i i0 j j0 j j0 þ i i0 i i0 j j0 j j0 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 1 ð39Þ Δ ¼ δi i0 δi i0 δj j0 δj j0 þ δi i0 δi i0 δj j0 δj j0 : 2 1 1 2 2 1 2 2 1 1 2 2 1 1 1 2 2

Intuition behind the main results. The goal of this section is to provide some Fig. 7 The block W is in the first layer of V(θ), and the operator Oi acts on intuition for our main results. Specifically, we show here how the scaling of the cost the topmost m qubits in the forward light-cone L of W. Dashed thick lines function variance can be related to the number of blocks we have to integrate to compute hÁ Á Ái ; , the locality of the cost functions, and with the number of layers indicate the backward light-cone of Oi. All but L − 1 blocks simplify to VR VL in the ansatz. identity in Ωqp of Eq. (34). First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/22m. Hence, we see that the more blocks one integrates over, the Finally, from (29) we can derive a general formula for the variance: worse the scaling can be. We now generalize the warm-up example. Let V(θ) be a single layer of the mÀ1 2 θ = 2 Tr½σνŠ q0 p0 p0 q0 alternating ansatz of Fig. 3, i.e., V( ) is a tensor product of m-qubit blocks Wk: Var½∂νCŠ¼ ∑ hΔΩqp i hΔΨpq i ν 2 pq ð31Þ W , with k = 1, …, ξ (and with ξ = n/m), so that θ is in the block W 0 . In the ð22m À 1Þ VR VL k1 k p0q0 Supplementary Note 5 we generalize this scenario to the when the input state is not ji0 , but instead is an arbitrary state ρ. which holds if WA and WB form independent 2-designs. Here, the summation runs From (31), the partial derivative of the global cost function in (1) can be p q p0 q0 n−m fi over all bitstrings , , , of length 2 . In addition, we de ned expressed as hi Ω Ω Y 2 q0p0 Tr½ qpŠTr½ q0 p0 Š m m y ΔΩ Ω Ω ; ð32Þ Var½∂νC Š¼υ Trji 0 hj0 W ji0 hj0 W qp ¼ Tr½ qp q0 p0 ŠÀ m G k k ð40Þ 2 k≠k0 Wk

m 2 σ2 υ ð2 À1Þ Tr½ ν Š where ¼ 2 . From (40) we have that in order to compute (40) one needs Ψ Ψ 22m ð2mþ1À1Þ p0 q0 Tr½ pqŠTr½ p0q0 Š ξ − fi ΔΨpq ¼ Tr½ΨpqΨp0 q0 ŠÀ ; ð33Þ to integrate over 1 blocks. Then, since each integration leads to a coef cient 1/ 2m ξ 22m the variance will scale as Oð1=ð22mÞ À1 ¼Oð1=22nÞ. Hence, the scaling of the Ω Ψ variance gets worse for each block we integrate (such that the block acts on qubits where Trw indicates the trace over subsystem Sw, and qp and qp are operators on fi we are measuring). Hw de ned as hi On the other hand, for a local cost let us consider a single term in (3) where ~ y j 2 Sk, so that Ωqp ¼ Tr ðjip hj q 1 ÞV OV ; ð34Þ w w R R hi υ 2 m y Var½∂νC Š/ Tr ðji0 hj0 1 ÞW~ji0 hj0 W~ : L 2 j j k k ð41Þ n ~ hi Wk y Ψpq ¼ Tr ðjiq hjp 1 ÞV ρV : ð35Þ w w L L Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance We derive Eq. (31) in the Supplementary Note 4. scales as Oð1=n2Þ, where we remark that the scaling is essentially given by the prefactor 1/n2 in (3). fl Computing averages over V. Here we introduce the main tools employed to Let us now brie y provide some intuition as to why the scaling of local cost compute quantities of the form 〈…〉 . These tools are used throughout the proofs gradients becomes exponentially vanishing with the number of layers as in V θ of our main results. Theorem 2. Consider the case when V( ) contains L layers of the ansatz in Fig. 3. fi Let us first remark that if the blocks in V(θ) are independent, then any average Moreover, as shown in Fig. 7, let W be in the rst layer, and let Oi act on the m over V can be computed by averaging over the individual blocks, i.e., topmost qubits of L. As schematically depicted in Fig. 7, we now have to − h ¼ i ¼h¼ i ; ¼ ; ; ¼ ¼h¼ i ; ; . For simplicity let us first consider the integrating over L 1 blocks. Then, as proved in the Supplementary Note 5, V W11 Wkl VL W VR integrating over a block leads to a coefficient 2m/2/(2m + 1). Hence, after expectation value over a single block W in the ansatz. In principle 〈…〉 can be W − fi mðLÀ1Þ=2= m LÀ1 approximated by varying the parameters in W and sampling over the resulting integrating L 1 times, we obtain a coef cient 2 ð2 þ 1Þ which Ω = 2m ×2m unitaries. However, if W forms a t-design, this procedure can be simplified vanishes no faster than ðÞ1 poly ðnÞ if mL 2Oðlog ðnÞÞ. ∂ as it is known that sampling over its distribution yields the same properties as As we discuss below, for more general scenarios the computation of Var[ νC] sampling random unitaries from the unitary group with respect to the unique becomes more complex. normalized Haar measure. Explicitly, the Haar measure is a uniquely defined left and right-invariant Sketch of the proof of the main theorems. Here we present a sketch of the proof measure over the unitary group dμ(W), such that for any unitary matrix A ∈ U(2m) of Theorems 1 and 2. We refer the reader to the Supplementary Information for a and for any function f(W) we have detailed version of the proofs. Z Z As mentioned in the previous subsection, if each block in V(θ) forms a local 2- 〈…〉 μ μ design, then we can explicitly calculate expectation values W via (38). Hence, to d ðWÞf ðWÞ¼ d ðWÞf ðAWÞ p0q0 p0 q0 Uð2mÞ compute hΔΩqp i , and hΔΨpq i in (31), one needs to algorithmically integrate Z ð36Þ VR VL ¼ dμðWÞf ðWAÞ ; over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits. For the sake of clarity, we recall that any two-qubit gate can be expressed as m ∑ where the integration domain is assumed to be U(2 ) throughout this work. U ¼ ijklUijkljiij hjkl , where Uijkl is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the Consider a finite set fW g (of size ∣Y∣) of unitaries W , and let P (W)bean m m m m y y2Y y (t, t) ansatz can be considered as a 2 2 ´ 2 2 ´ 2 2 ´ 2 2 tensor. As schematically shown in i arbitrary polynomial of degree at most t in the matrix elements of W and at most t Fig. 8a, one can use the circuit description of Ωqp and Ψpq to derive the tensor

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w

i Fig. 8 Tensor-network representations of the terms relevant to Var[∂νC]. a Representation of Ωqp of Eq. (34) (left), where the superscript indicates that i j O is replaced by Oi. In this illustration, we show the case of n = 2m qubits, and we denote Ppq ¼ jiq hjp . We also show the representation of Tr½ΩqpΩq0p0 Š Ωi Ωj (middle) and of Tr½ qpŠTr½ q0p0 Š (right). b By means of the Weingarten calculus we can algorithmically integrate over each blockR in the ansatz. After each dμ W Ωi Ωj Rintegration one obtains four new tensors according to Eq. (38). Here we show the tensor obtained after the integrations ð ÞTr½ qp q0p0 Š and dμ W Ωi Ωj ΔΩq0p0 ð ÞTr½ qpŠTr½ q0p0 Š, which are needed to compute h qp iV as in Eq. (31). c As shown in the Supplementary Note 1, the result of the integration is a 0 0 R sum of the form (49), where the deltas over p, q, p , and q arise from the contractions between Ppq and Pp0q0 .

i j i network representation of terms such as Tr½Ω Ω 0 0 Š. Here, Ω is obtained from show that qp q p qp DE (34) by simply replacing O with Oi. ∑ δ δ δ δ ΔΨp0 q0 ðp;qÞ ðp0 ;q0 Þ ðp;q0 Þ À ðp0;qÞ À pq In Fig. 8b we depict an example where we employ the tensor network pq Sþ Sþ S S V p0 q0 w w w w L Ωi Ωi Ωj Ωi Ωj  representation of qp to compute the average of Tr½ qp q0p0 Š, and Tr½ qpŠTr½ q0p0 Š. 1 ð44Þ As expected, after each integration one obtains a sum of four new tensor networks ¼ D ~ρÀ; Tr ½~ρÀŠ ; HS w 2m according to Eq. (38). VL fi ~ρÀ ~ρ ρ y where we de ne as the reduced states of ¼ VL VL in the Hilbert spaces Proof of Theorem 2 fi ∪ À – . Let us rst consider an m-local cost function C where O is associated with subsystems Sw Sw . Here we recall that DHS is the Hilbert Schmidt b ρ; σ ρ σ 2 given by (16), and where Oi acts nontrivially in a given subsystem Sk of S.In distance DHSðÞ¼Tr½ð À Þ Š. b fi particular, when Oi is of this form the proof is simpli ed, although the more By employing properties of DHS one can show (see Supplementary Note 6) 6 fi ÀÁ general proof is presented in the Supplementary Note 6. If Sk SL we nd 1 eρ ; 1 Ωi 1 À À DHS w 2m qp / w, and hence D ~ρ ; Tr ½~ρ Š ≥ ; ð45Þ HS w 2m 2mðLÀlþ2Þ=2 i j ½Ω Š ½Ω 0 0 Š ~ρ ~ρÀ i j Tr qp Tr q p where ¼ TrSÀ ½ Š. We can then leverage the tensor network representation of Ω Ω : ð42Þ w w Tr½ qp q0 p0 ŠÀ m ¼ 0 2 quantumÀÁ circuits to algorithmically integrate over each block in VL and compute eρ ; 1 fi b hDHS w 2m iV . One nds The latter implies that we only have to consider the operators O which act on L  i 1 qubits inside of the forward light-cone L of W. ~ρ ; ∑ ϵ ρ ; DHS ¼ tk;k0 ð ; 0 Þ w m ; 0 k k ð46Þ Then, as shown in the Supplementary Information 2 ðk k Þ2kL VL B *+ k0 ⩾k Ωi Ωi ml Tr½ qpŠTr½ q0 p0 Š ⩾ 2 ; 0 ϵ ρ fi Ωi Ωi ϵ b : with tk;k0 m 2l 8k k , and ð ; 0 Þ de ned in Theorem 2. Combining these results Tr½ qp q0 p0 ŠÀ / ðO Þ ð43Þ ð2 þ1Þ k k 2m i VR leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form (16). Here we remark that the proportionality factor contains terms of the form þ À δ δ 0 0 δ 0 δ 0 ∪ ðp;qÞ þ ðp ;q Þ þ ðp;q Þ À ðp ;qÞ À (where S S ¼ S ), which arises from the S S S S w w w Proof of Theorem 1. Let us now provide a sketch of the proof of Theorem 1, case w w w w q p b : b different tensor contractions of Ppq ¼ jihjin Fig. 8c. It is then straightforward to (i). Here we denote for simplicity Ok ¼ O1k. We leave the proof of case (ii) for the

10 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w ARTICLE

Supplementary Note 7. In this case there are now operators Oi which act outside of Data availability the forward light-cone L of W. Hence, it is convenient to include in VR not only all Data generated and analyzed during the current study are available from the fi θ the gates in L but also all the blocks in the nal layer of V( ) (i.e., all blocks WkL, corresponding author upon reasonable request. = …ξ fi with k 1, ). We can de ne SL as the compliment of SL, i.e., as the subsystem of all qubits which are not in L (with associated Hilbert-space H ). Then, we L Received: 5 February 2020; Accepted: 5 February 2021; ¼ q p ¼ q p q p fi :¼ haveN VR V L VL andN L L, whereN we de ne VL q p : q p q p : q p 2 WkL, ¼ 2 , and ¼ 0 2 0 . Here, we k kL L k kL k L k kL k fi : : : : de ne kL ¼fk Sk  SLg and kL ¼fk Sk  SLg, which are the set of indices whose associated qubits are inside and outside LN, respectively. We alsoN write 1 ^ ^ fi ^ : b ^ : b O ¼ c0 þ c1OL O , where we de ne OL ¼ 2 Ok and O ¼ 0 2 Ok0 . L k kL L k kL Using the fact that the blocks in V(θ) are independent we can now compute References q0p0 q0p0 1. Preskill, J. in the NISQ era and beyond. Quantum 2,79 ΔΩqp ΔΩqp fi Ωpq h iV ¼h iV ;V . Then, from the de nition of in Eq. (34) and the R L L (2018). fact that one can always express *+2. McClean, J. R., Romero, J., Babbush, R. & Aspuru-Guzik, A. The theory of DE L L variational hybrid quantum-classical algorithms. New J. Phys. 18, 023023 Tr½ΩqpŠTr½Ωq0 p0 Š ΔΩq0 p0 ΩL ΩL (2016). qp ¼ Tr½ qp q0 p0 ŠÀ m VR 2 3. Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization VL 0 1 ð47Þ Y algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014). ´ @ Ω A ; 4. Hadfield, S. et al. From the quantum approximate optimization algorithm to a hik WkL quantum alternating operator ansatz. Algorithms 12, 34 (2019). k2k L 5. Hastings, M. B. Classical and quantum bounded depth approximation with algorithms. Preprint at https://arxiv.org/abs/1905.07047 (2019). Âà 6. Kandala, A. et al. Hardware-efficient variational quantum eigensolver for ΩL ¼ Tr ðjip hjq 1 ÞV OLV qp hiL\w w L hiL small molecules and quantum magnets. Nature 549, 242 (2017). Ω ¼ Tr jip hjq Wy Ob W Tr jip0 hjq0 Wy Ob W 7. Arute, F. et al. Hartree-fock on a superconducting qubit quantum computer. k k kL k kL k kL k kL Science 369, 1084–1089 (2020a). 8. Harrigan, Matthew P. et al. Quantum approximate optimization of non-planar and where TrL\w indicates the partial trace over the Hilbert-space associated with graph problems on a planar superconducting processor. Nature Physics 1–5 the qubits in SL \ Sw. As detailed in the Supplementary Information we can use Eq. (38) to show that (2021). 9. McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren 2 δ δ δ δ 9 rk ðp;qÞ ðp0 ;q0Þ þ ðp;q0 Þ ðp0 ;qÞ plateaus in quantum neural network training landscapes. Nat. Commun. , Ω ⩽ Sk Sk Sk Sk : ð48Þ hik 4812 (2018). WkL 22m À 1 10. Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum On the other hand, as shown in the Supplementary Note 7 (and as processor. Nat. Commun. 5, 4213 (2014). schematically depicted in Fig. 8c), when computing the expectation value h ¼ i VL 11. Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders in (47), one obtains fi 2 *+ for ef cient compression of quantum data. Quant. Sci. Technol. , 045001 L L (2017). Tr½Ω ŠTr½Ω 0 0 Š ΩL ΩL qp q p ∑ ijΔ Lδ ; Tr½ qp q0 p0 ŠÀ ¼ tτ Oτ τ ð49Þ 12. Johnson, P. D., Romero, J., Olson, J., Cao, Y. & Aspuru-Guzik, A. QVECTOR: 2m τ VL an algorithm for device-tailored quantum error correction. Preprint at https:// fi δ δ δ δ δ R ∪ arxiv.org/abs/1711.02249 (2017). where we de ned τ ¼ ðp;qÞ ðp0;q0 Þ ðp;q0 Þ ðp0;qÞ , tτ 2 , Sτ Sτ ¼ SL \ Sw Sτ Sτ Sτ Sτ 13. Koczor, B., Endo, S., Jones, T., Matsuzaki, Y. & Benjamin, S. C. Variational- (with Sτ ≠ ;), and state . New J. Phys. 22, 083038 (2020). hi 14. Khatri, S. et al. Quantum-assisted quantum compiling. Quantum 3, 140 Δ L Oτ ¼ Trxτ yτ Trzτ ½ŠOi Trzτ ½OjŠ (2019). hi ð50Þ 15. Jones, T. & Benjamin, S. C. Quantum compilation and circuit Tr Tr ½ŠO Tr ½O Š xτ yτ zτ i yτ zτ j optimisation via energy dissipation. Preprint at https://arxiv.org/abs/ À : 2m 1811.03147 (2019). Here we use the notation Tr to indicate the trace over the Hilbert space associated 16. Sharma, K., Khatri, S., Cerezo, M. & Coles, P. J. Noise resilience of variational xτ quantum compiling. New J. Phys. 22, 043006 (2020a). with subsystem S , such that S ∪ S ∪ S ¼ S . As shown in the Supplementary xτ xτ yτ zτ L DE 17. Heya, K., Suzuki, Y., Nakamura, Y. & Fujii, K. Variational quantum gate p0 q0 Note 7, combining the deltas in Eqs. (48), and (49) with ΔΨpq leads to optimization. Preprint at https://arxiv.org/abs/1810.12745 (2018). VL ’ – 18. LaRose, R., Tikku, A., O Neel-Judy, É., Cincio, L. & Coles, P. J. Variational Hilbert Schmidt distances betweenÀÁ two quantumQ states as in (44). One can then 5 – L 2 quantum state diagonalization. npj Quant. Inf. ,1 10 (2018). use the following bounds D ρ ; ρ ≤ 2, ΔOτ ≤ r , and ∑τtτ ≤ 2, along with HS 1 2 k2kL k 19. Bravo-Prieto, C., García-Martín, D. & Latorre, J. Quantum singular value some additional simple algebra explained in the Supplementary Information to decomposer. Phys. Rev. A 101, 062310 (2020). obtain the upper bound in Theorem 1. 20. Li, Y. & Benjamin, S. C. Efficient variational incorporating active error minimization. Phys. Rev. X 7, 021050 (2017). Ansatz and optimization method. Here we describe the gradient-free optimiza- 21. Heya, K., Nakanishi, K. M., Mitarai, K. & Fujii, K. Subspace variational tion method used in our heuristics. First, we note that all the parameters in the quantum simulator. Preprint at https://arxiv.org/abs/1904.08566 (2019). ansatz are randomly initialized. Then, at each iteration, one solves the following 22. Cirstoiu, C. et al. Variational fast forwarding for quantum simulation beyond sub-space search problem: min Cðθ þ A Á sÞ, where A is a randomly generated the time. npj Quant. Inf. 6,1–10 (2020). d s2R 23. Otten, M., Cortes, C. L. & Gray, S. K. Noise-resilient using isometry, and s = (s , …, s ) is a vector of coefficients to be optimized over. We 1 d symmetry-preserving ansatzes. Preprint at https://arxiv.org/abs/1910.06284 used d = 10 in our simulations. Moreover, the training algorithm gradually (2019). increases the number of shots per cost-function evaluation. Initially, C is evaluated 24. Cerezo, M., Poremba, A., Cincio, L. & Coles, P. J. Variational quantum fidelity with 10 shots, and once the optimization reaches a plateau, the number of shots is estimation. Quantum 4, 248 (2020). increased by a factor of 3/2. This process is repeated until a termination condition 5 25. Carolan, J. et al. Variational quantum unsampling on a quantum photonic on the value of C is achieved, or until we reach the maximum value of 10 shots per 16.3 function evaluation. While this is a simple variable-shot approach, we remark that processor. Nature Physics 322-327 (2020). a more advanced variable-shot optimizer can be found in ref. 57. 26. Arrasmith, A., Cincio, L., Sornborger, A. T., Zurek, W. H. & Coles, P. J. Finally, let us remark that while we employ a sub-space search algorithm, in the Variational as a hybrid algorithm for quantum 10 presence of barren plateaus all optimization methods will (on average) fail unless foundations. Nat. Commun. , 3438 (2019). the algorithm has a precision (i.e., number of shots) that grows exponentially with 27. Bravo-Prieto, C. et al. Variational quantum linear solver. Preprint at https:// n. The latter is due to the fact that an exponentially vanishing gradient implies that arxiv.org/abs/1909.05820 (2019). on average the cost function landscape will essentially be flat, with the slope of the 28. Xu, X. et al. Variational algorithms for linear algebra. Preprint at https://arxiv. order of Oð1=2nÞ. Hence, unless one has a precision that can detect such small org/abs/1909.03898 (2019). changes in the cost value, one will not be able to determine a cost minimization 29. Huang, H.-Y., Bharti, K. & Rebentrost, P. Near-term quantum algorithms for direction with gradient-based, or even with black-box optimizers such as the linear systems of equations. Preprint at https://arxiv.org/abs/1909.07344 Nelder–Mead method58–61. (2019).

NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications 11 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-21728-w

30. Cerezo, M. & Coles, P. J. Impact of barren plateaus on the hessian and higher 56. Puchała, Z. & Miszczak, J. A. Symbolic integration with respect to the haar order derivatives. Preprint at https://arxiv.org/abs/2008.07454 (2020). measure on the unitary groups. Bull. Pol. Acad. Sci. Tech. Sci. 65,21–27 31. Arrasmith, A., Cerezo, M., Czarnik, P., Cincio, L. & Coles, P. J. Effect of barren (2017). plateaus on gradient-free optimization. Preprint at https://arxiv.org/abs/ 57. Kübler, J. M., Arrasmith, A., Cincio, L. & Coles, P. J. An adaptive optimizer for 2011.12245 (2020). measurement-frugal variational algorithms. Quantum 4, 263 (2020). 32. Grant, E., Wossnig, L., Ostaszewski, M. & Benedetti, M. An initialization 58. Nelder, J. A. & Mead, R. A simplex method for function minimization. strategy for addressing barren plateaus in parametrized quantum circuits. Comput. J. 7, 308–313 (1965). Quantum 3, 214 (2019). 59. Paley, R. E. A. C. & Zygmund, A. A note on analytic functions in the unit 33. Verdon, G. et al. Learning to learn with quantum neural networks via classical circle. Math. Proc. Camb. Phil. Soc. 28, 266 (1932). neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019a). 60. Fukuda, M., König, R. & Nechita, I. RTNI–a symbolic integrator for haar- 34. Lee, J., Huggins, W. J., Head-Gordon, M. & Whaley, K. B. Generalized unitary random tensor networks. J. Phys. A 52, 425303 (2019). coupled cluster wave functions for quantum computation. J. Chem. Theory 61. Nielsen, M. A. & Chuang, I. L. Quantum computation and quantum Comput. 15, 311–324 (2018). information: 10th Anniversary Edition, 10th ed. (Cambridge University Press, 35. Verdon, G. et al. Quantum graph neural networks. Preprint at https://arxiv. New York, NY, USA, 2011) org/abs/1909.12264 (2019b). 36. IBM Q: Quantum devices and simulators. https://www.research.ibm.com/ ibm-q/technology/devices/. Acknowledgements 37. Arute, F. et al. using a programmable superconducting We thank Jacob Biamonte, Elizabeth Crosson, Burak Sahinoglu, Rolando Somma, Guil- processor. Nature 574, 505–510 (2019). laume Verdon, and Kunal Sharma for helpful conversations. All authors were supported by 38. Dankert, C., Cleve, R., Emerson, J. & Livine, E. Exact and approximate unitary the Laboratory Directed Research and Development (LDRD) program of Los Alamos 2-designs and their application to fidelity estimation. Phys. Rev. A 80, 012304 National Laboratory (LANL) under project numbers 20180628ECR (for M.C.), (2009). 20190065DR (for A.S., L.C., and P.J.C.), and 20200677PRD1 (for T.V.). M.C. and A.S. were 39. Brandao, F. G. S. L., Harrow, A. W. & Horodecki, M. Local random quantum also supported by the Center for Nonlinear Studies at LANL. P.J.C. acknowledges initial circuits are approximate polynomial-designs. Commun. Math. Phys. 346, support from the LANL ASC Beyond Moore’s Law project. This work was also supported 397–434 (2016). by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific 40. Harrow, A. & Mehraban, S. Approximate unitary t-designs by short random Computing Research, under the Quantum Computing Application Teams program. quantum circuits using nearest-neighbor and long-range gates. Preprint at https://arxiv.org/abs/1809.06957 (2018). Author contributions 41. Wan, K. H., Dahlsten, O., Kristjánsson, H., Gardner, R. & Kim, M. S. The project was conceived by M.C., L.C., and P.J.C. The manuscript was written by M.C., Quantum generalisation of feedforward neural networks. npj Quant. Inf. 3,36 A.S., T.V., L.C., and P.J.C. T.V. proved Proposition 1. M.C. and A.S. proved Proposition 2 (2017). and Theorems 1–2. M.C., A.S., T.V., and P.J.C. proved Corollaries 1–2. M.C., A.S., T.V., 42. Lamata, L. et al. Quantum autoencoders via quantum adders with genetic L.C., and P.J.C. analyzed the quantum autoencoder. For the numerical results, T.V. algorithms. Quant. Sci. Technol. 4, 014007 (2018). performed the simulation in Fig. 2, and L.C. performed the simulation in Fig. 5. 43. Pepper, A., Tischler, N. & Pryde, G. J. Experimental realization of a quantum autoencoder: the compression of via machine learning. Phys. Rev. Lett. 122, 060501 (2019). Competing interests 44. Verdon, G., Pye, J. & Broughton, M. A universal training algorithm The authors declare no competing interests. for quantum . Preprint at https://arxiv.org/abs/1806.09729 (2018). Additional information 45. Pennington, J. & Bahri, Y. Geometry of neural network loss surfaces via Supplementary information The online version contains supplementary material random matrix theory. in Proceedings of the 34th International Conference on available at https://doi.org/10.1038/s41467-021-21728-w. Machine Learning-Volume 70 (JMLR. org, 2017) pp. 2798–2806, http:// proceedings.mlr.press/v70/pennington17a.html. Correspondence 46. Tranter, A., Love, P. J., Mintert, F. & Coveney, P. V. A comparison of and requests for materials should be addressed to M.C. or P.J.C. – – the bravyi kitaev and jordan wigner transformations for the quantum Peer review information Nature Communications thanks the anonymous reviewer(s) for 14 – simulation of . J. Chem. Theory Comput. , 5617 5630 their contribution to the peer review of this work. Peer reviewer reports are available. (2018). 298 47. Bravyi, S. B. & Kitaev, A. Y. Fermionic quantum computation. Ann. Phys. , Reprints and permission information is available at http://www.nature.com/reprints 210–226 (2002). 48. Uvarov, A., Biamonte, J. D. & Yudin, D. Variational quantum eigensolver for Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in frustrated quantum systems. Phys. Rev. B 102, 075104 (2020). published maps and institutional affiliations. 49. Biamonte, J. Universal variational quantum computation. Preprint at https:// arxiv.org/abs/1903.04500 (2019). 50. Sharma, K., Cerezo, M., Cincio, L. & Coles, P. J. Trainability of dissipative -based quantum neural networks. Preprint at https://arxiv.org/abs/ Open Access This article is licensed under a Creative Commons 2005.12458 (2020b). Attribution 4.0 International License, which permits use, sharing, 51. Beer, K. et al. Training deep quantum neural networks. Nat. Commun. 11,1–6 adaptation, distribution and reproduction in any medium or format, as long as you give (2020). appropriate credit to the original author(s) and the source, provide a link to the Creative 52. Bartlett, R. J. & Musiał, M. Coupled-cluster theory in quantum chemistry. Rev. Commons license, and indicate if changes were made. The images or other third party Modern Phys. 79, 291 (2007). material in this article are included in the article’s Creative Commons license, unless 53. Volkoff, T. & Coles, P. J. Large gradients via correlation in random indicated otherwise in a credit line to the material. If material is not included in the parameterized quantum circuits. Quant. Sci. Technol. (2021). https:// article’s Creative Commons license and your intended use is not permitted by statutory iopscience.iop.org/article/10.1088/2058-9565/abd891. regulation or exceeds the permitted use, you will need to obtain permission directly from 54. Skolik, A. et al. Layerwise learning for quantum neural networks. Quantum the copyright holder. To view a copy of this license, visit http://creativecommons.org/ Machine Intelligence 3.1 1–11 (2021). licenses/by/4.0/. 55. Benoît, C. & Śniady, P. Integration with respect to the haar measure on unitary, orthogonal and symplectic group. Commun. Math. Phys. 264, 773–795 (2006). © The Author(s) 2021

12 NATURE COMMUNICATIONS | (2021) 12:1791 | https://doi.org/10.1038/s41467-021-21728-w | www.nature.com/naturecommunications