<<

Learning quantum many-body systems from a few copies

Cambyse Rouz´e1, ∗ and Daniel Stilck Fran¸ca2, † 1Department of Mathematics, Technische Universit¨atM¨unchen,85748 Garching, Germany 2QMATH, Department of Mathematical Sciences, University of Copenhagen, Denmark (Dated: August 25, 2021) Estimating physical properties of quantum states from measurements is one of the most funda- mental tasks in quantum science. In this work, we identify conditions on states under which it is possible to infer the expectation values of all quasi-local observables of a given locality up to a relative error from a number of samples that grows polylogarithmically with the system’s size and polynomially on the locality of the target observables. This constitutes an exponential improvement over known tomography methods in some regimes. We achieve our results by combining one of the most well-established techniques to learn quantum states, namely the maximum entropy method, with tools from the emerging fields of quantum optimal transport and classical shadows. We conjec- ture that our condition holds for all states exhibiting some form of decay of correlations and establish it for several subsets thereof. These include widely studied classes of states such as one-dimensional thermal and high-temperature Gibbs states of local commuting Hamiltonians on arbitrary hyper- graphs or outputs of shallow circuits. Moreover, we show improvements of the maximum entropy method beyond the sample complexity of independent interest. These include identifying regimes in which it is possible to perform the postprocessing efficiently as well as novel bounds on the condition number of covariance matrices of many-body states.

I. INTRODUCTION in learning only physical properties of the state on which tomography is being performed. These are often encoded into the expectation values of The subject of quantum tomography has as quasi-local observables that only depend on re- its goal devising methods for the efficient classi- duced density matrices of subregions of the sys- cal description of a quantum system from access tem. By Helstrom’s theorem, obtaining a good to experimental data. However, all tomographic recovery guarantee in trace distance is equivalent methods for general quantum states inevitably to demanding that the expectation value of all require resources that scale exponentially in the bounded observables are close for the two states. size of the system [1, 2], be it in terms of the num- ber of samples required or the post-processing It is in turn desirable to design tomographic needed to perform the task. procedures that can take advantage of the fact that we wish to only approximate quasi-local ob- Fortunately, most of the physically relevant servables, instead of demanding a recovery in quantum systems can be described in terms of trace distance, and some methods in the lit- a (quasi)-local structure. These range from that erature take advantage of that. For instance, of a local interaction Hamiltonian corresponding the overlapping tomography or classical shad- to a finite temperature Gibbs state to that of ows methods of [9–12] allow for approximately a shallow quantum circuit. Hence, locality is a learning all k-local reduced density matrices of physically motivated requirement that brings the an n-qubit state with failure probability δ us- amount of parameters describing the system to a ck −1 −2 arXiv:2107.03333v2 [quant-ph] 24 Aug 2021 ing (e k log(nδ ) ) copies without impos- tractable number. Effective tomographic proce- ingO any assumptions on the underlying state. dures should be able to incorporate this informa- This constitutes an exponential improvement in tion. And, indeed, many protocols in the litera- the system size compared to the previously men- ture achieve a good recovery guarantee in trace tioned many-body setting at the expense of an distance from a number of copies that scales poly- undesirable exponential dependency in the lo- nomially with system size [3–8]. cality of the observables. Unfortunately, with- Furthermore, in many cases one is interested out making any assumptions on the underlying states, this exponential scaling is also unavoid- able [9]. ∗ [email protected] In light of the previous discussion, it is natural † [email protected] 2 to ask if, combining the stronger structural as- chitz observables, transportation cost inequali- sumptions required for many-body tomography ties and the the maximum entropy principle. with techniques like classical shadows, it is pos- sible to obtain the best of the two worlds: a sam- ple complexity that is logarithmic in system size A. Lipschitz observables and polynomial in the locality of the observables we wish to estimate. In the classical setting, given a metric d on In this work, we achieve precisely this demand a sample space , the regularity of a function for a large class of physically motivated states S f : R can be quantified by its Lipschitz by combining recent techniques from quantum constantS →[29, Chapter 3] optimal transport [13–18], the well-established maximum entropy method [19] to learn quan- f Lip = sup f(x) f(y) /d(x, y) . (1) tum states and the method of classical shadows. k k x,y∈S | − | This results in an exponential improvement over For instance, if we consider functions on the n- known methods of many-body tomography [3– dimensional hypercube 1, 1 n endowed with 8] and recent shadow tomography or overlapping the Hamming distance,{− the Lipschitz} constant tomography techniques [9–12], as summarized in quantifies by how much a function can change Table I. per flipped . It should then be clear that Our revisit of the maximum entropy princi- physical quantities like average magnetization ple is further motivated by recent breakthroughs have a small Lipschitz constant. Some recent in Hamiltonian learning [5, 20], shadow to- works [14, 16] extended this notion to the non- mography [9], the understanding of correlations commutative setting. Although the definition and computational complexity of quantum Gibbs put forth in [14] and the approach in [16] are not states [21–25] and quantum functional inequal- equivalent, they both recover the intuition dis- ities [18, 26] that shed new light on this sea- cussed previously. E.g., for the approach followed soned technique. Examples where we obtain ex- in [16], the Lipschitz constant of an observable on ponential improvements include thermal states of n qudits is defined as [30] 1D systems and high-temperature thermal states of commuting Hamiltonians on arbitrary hyper- O Lip, k k  := max max tr [O(ρ σ)] , graphs and outputs of shallow circuits. Further- √n 1≤i≤n ρ,σ∈Ddn − more, based on results by [24], we conjecture that tri[ρ]=tri[σ] our results should hold for any high-temperature (2) Gibbs state, even with long-range interactions, where n denotes the set of n-qudit states. and give evidence in this direction. Dd That is, O Lip, quantifies the amount by which The main ingredient to obtain our improve- the expectationk k value of O changes for states ments are so-called transportation cost (TC) in- that are equal when tracing out one site. It equalities [27]. They allow us to relate by how is clear that O Lip 2√n O ∞ always holds much the expectation values of two states on by H¨older’sinequality,k k ≤ butk itk can be the case Lipschitz observables, a concept we will review that O Lip, √n O ∞. For instance, con- shortly, differ from their relative entropy. Such siderk fork somek > k0k the n-qubit observable inequalities constitute a powerful tool from the n i+k O = i=1 j=i Zj, where for each site j, Zj de- theory of optimal transport [28] and are tradi- notes the Pauli⊗ observable Z acting on site j and tionally used to prove sharp concentration in- we takeP addition modulo n. It is not difficult equalities [29, Chapter 3]. Moreover, they have to see that O Lip, = 2k√n, while O ∞ = n. been recently extended to quantum states [14, We refer tok thek discussion in Fig. 1k fork another 16, 18]. By combining such inequalities with the example. maximum entropy principle, we are able to eas- Moreover, one can show that shallow local ily control the relative entropy between the states circuits or short-time local dynamics satisfying and, thus, the difference of expectation values of a Lieb-Robinson bound cannot substantially in- Lipschitz observables. crease the Lipschitz constant of an observable Before we summarize our contributions in when evolved in the Heisenberg picture. That is, more detail, we first define and revise the main ∗ if we have that Φt is the quantum channel that concepts required for our results, namely Lips- describes some local dynamics at time/depth t 3

Structure Assumptions on State Assumptions on Observable Samples Many-body tomography Many-body none poly(n, −1) [5] Classical Shadows none k-local poly(eck, log(n), −1) [9] This work Many-body+transportation k-local poly(k, log(n), −1)

TABLE I. Summary of underlying assumptions and sample complexity of other approaches to perform tomog- raphy on quantum many-body states. in the Heisenberg picture and it satisfies a Lieb- can define a Wasserstein-1 distance on states by Robinson bound, then we have: duality [14, 16]. The latter quantifies how well we can distinguish two states by their action on cer- Φ∗(O) = (evt O ) , t Lip, Lip, tain regular or local observables and is given k k O k k LΛ where v denotes the Lieb-Robinson velocity. This by result is discussed in more detail in Section B 1 of the supplemental material. Thus, averages W1,Λ(ρ, σ) := sup tr [O(ρ σ)] . (4) O∈LΛ − over local observables and short-time evolutions thereof all belong to the class of observables that † Here we will consider Λ = O = O , O Lip,Λ have a small Lipschitz constant when compared L { k k ≤ 1 for . Lip,Λ . Lip, , . Lip,∇ and to generic observables. } k k ∈ {k k  k k } Loc,(k,g) as the set of (k, g) quasi-local observ- Another notion of a Lipschitz constant was put ablesL as in Eq. (3). forth in [14]. Their definition is of a more geomet- Definition (4) is in direct analogy with the def- ric flavour and is connected to so-called differen- inition of the trace distance, where we substi- tial structures related to a quantum state [13], tuted the optimization over bounded observables but has the downside of being state-dependent. by that over Lipschitz ones. We refer to Sec. B 1 We will exemplify it for classical states. Let for a thorough discussion of all these definitions. a = 1 0 and a act like a on qubit i and iden- | i h | i We remark that 1,Loc corresponds to the weak- tity else. Then for classical states a differential est of them, as itW can only capture information Lipschitz constant can be defined as contained in k-body reduced density matrices, 2 2 † 2 while the other definitions also capture global in- O Lip,∇ := [ai,O] ∞ + [ai ,O] ∞. k k k k k k formation. On the other hand, 1,Loc allows for i i X X a more straightforward analysis,W which will allow It is not difficult to see that this definition ex- us to obtain results for high-temperature Hamil- hibits a scaling for local observables similar to tonians even with long-range interactions. that of Eq. (2). We refer to Sec. B 1 b of the supplemental for more details on it. The previous definitions allow us to quantify B. Transportation cost inequalities the regularity of all observables, not only the strictly local ones considered so far. However, The paragraphs before motivated the idea that it will also be fruitful to explicitly look at quasi- observables with a small Lipschitz constant cap- local observables [24]. Given a graph V of car- ture most physically relevant observables and, dinality n corresponding to a system of n qudits thus, that controlling the Wasserstein distance of corresponding Hilbert space ⊗n, an observ- H between two states gives rise to a more physi- able X n is said to be (k, g) quasi-local for d cally motivated distance measure than the trace k N and∈ Mg > 0 if it can be decomposed as ∈ distance. However, it is a priori not clear how to effectively control the Wasserstein distance be- X = X with X ∞ g for v V, A k Ak ≤ ∈ 3 tween states. |AX|≤k AXv (3) In this work, we will achieve this by relating Wasserstein distances to the rela- i.e., each XA acts on at most k systems and the tive entropy between two states, D(ρ σ) := norms of terms acting on each system is bounded. tr [ρ (log(ρ) log(σ))], for σ of full-rank.k This Once we are given a Lipschitz constant on ob- can be achieved− through the notion of a trans- servables or a set of quasi-local observables, we portation cost inequality: an n-qdit state σ is 4 said to satisfy a transportation cost inequality ing that a transportation cost inequality holds for with constant α > 0 if the Wasserstein distance such systems for the stronger definitions of the of σ to any other state ρ can be controlled by Lipschitz constant and for all Gibbs states above their relative entropy, i.e. a threshold temperature that is independent of the locality of the observables. Establishing such D(ρ σ) an inequality would allow us to apply our results W1,Λ(ρ, σ) k (5) ≤ r 2α to an even broader range of physical systems. holds for all states ρ n . ∈ Dd Such inequalities are particularly powerful C. Maximum-entropy methods whenever the constant α does not depend on the system size n or does so at most inverse polyloga- rithmically, and can be thought of as a strength- Let us now show how transportation cost in- ening of Pinkser’s inequality. equalities can be combined with maximum en- Transportation cost inequalities are closely tropy methods. Such methods depart from the related to the notion of Gaussian concentra- assumption that we are given a set of self-adjoint, tion [14, 16, 31], i.e. that Lipschitz functions linearly independent Hermitian observables over sa an n-qudit system, E ,...,E n , with strongly concentrate around their mean. Estab- 1 m ∈ Md E ∞ 1, a maximal inverse temperature β > lishing analogues of such concentration inequali- k ik ≤ ties for quantum-many body systems has been a 0 and the promise that the state we wish to learn fruitful line of research in the last years and they can be expressed as: are related to fundamental questions in statistical m physics, see e.g. [24, 32–35]. Although we are cer- exp β λ E − i i tain that inequalities like Eq. (5) also shed new σ σ(λ) =  i=1  , (6) light on this matter, here we will focus on their ≡ (Pλ) Z application to learning a classical description of where λ Rm with sup norm λ 1 and a state through maximum entropy methods. We ∈ k k`∞ ≤ refer to Table II for a summary of classes of states m known to satisfy it, as discussed in more detail (λ) = tr exp β λ E . Z − i i below. " i=1 !# X Recent works have established transportation cost inequalities with α either constant or loga- Denoting by e(λ) the vector of components rithmic in system size for several classes of Gibbs ei(λ) = tr [Ei σ(λ)], the crux of the maximum states of commuting Hamiltonians [16, 18, 26]. entropy method is that λ is the unique optimizer They essentially are known to hold for local of Hamiltonians on arbitrary hypergraphs at high m enough temperature or in 1D. Moreover, these min log( (µ)) + β µiei(λ) , (7) µ∈Rm Z works obtain the inequality for the stronger k k ≤ i=1 µ `∞ 1 X . Lip, and . Lip,∇ notions of Lipschitz con- stant.k k In thisk workk we enlarge the class of ex- which gives us a variational principle to learn the amples by showing them for outputs of short- state given e(λ). We refer to Sec. A for a discus- depth circuits. Moreover, building upon the re- sion of the maximum entropy principle and its sults of [24], we show that they also hold for properties. high-temperature Gibbs states, even in the pres- Typical examples of observables Ei are e.g. all ence of long-range interactions. We refer to Sec- 2 local Pauli observables corresponding to edges tion B 1 c for a precise definition and proofs. of− a known graph. This models the situation in However, we only achieve this result for the which we know a state to be the thermal state Lip,Loc,(k,g) definition and for inverse tempera- of a Hamiltonian with a known locality struc- k·k tures β = ((kg)−1). Although nontrivial, such ture. More generally, for most of the examples O an inequality still has the undesired properties discussed here the Ei are given as follows: we de- of only applying to strictly local observables and part from a hypergraph G = (V,E) and assume of the inverse temperature depending on the lo- that there is a maximum radius r0 N such that, cality. Despite these shortcomings, we believe for any hyperedge A E, there exists∈ a vertex that this preliminary result allows for conjectur- v V such that the ball∈ B(v, r ) centered at v ∈ 0 5 and of radius r0 includes A. The Ei are then pends only on k. Based on the discussion above given by a basis of the traceless matrices on each on shallow circuits, we believe that inverse tem- hyperedge A. This definition captures the notion peratures scaling like (log(−1)) suffice to ob- of a local Hamiltonian w.r.t. the hypergraph. tain a good approximationO in Wasserstein dis- This framework also encompasses pure states tance, which would let us surpass the problem of after making an appropriate approximation. For the scaling of the TC constants with the temper- instance, in this article we will also consider the ature. Figuring these and other details required outputs of shallow quantum circuits and believe for gapped systems will be the subject of future our framework extends to unique ground states work and we refer to Sec. C 1 for a more thorough of gapped Hamiltonians, which are pure. To see discussion. how to perform the approximation for circuits, We see from Eq. (7) that the expectation val- ⊗n suppose that ψ = 0 is the output of an ues of the Ei completely characterize the state unknown shallow| i circuitU | i of depth L with re- σ(λ). But it is possible to obtain a more quanti- spect to a known interactionU graph (V,E) on n tative version of this statement through the fol- vertices. That is, lowing identity, also observed in [5]:

= `,e , D(σ(µ) σ(λ)) + D(σ(λ) σ(µ)) U U k k `∈[L] e∈E` = β λ µ e(λ) e(µ) . (8) Y O − h − | − i where each E are a subset of non- ` In addition to showing that if e(λ) = e(µ) then intersecting edges.E ⊂ That is, in this setting the σ(µ) = σ(λ), Eq. (8) implies that by control- locality of the circuit is known, but the under- ling how well the local expectations values of one lying unitary is not. As we then show in Theo- state approximate another, we can also easily rem C.1, the state ψ is 2 close in Wasserstein control their relative entropies. In particular, if distance to the Gibbs| i state σ corresponding to m = (n) and e(µ) e(λ) = (n) for some the Hamiltonian with local terms Z † at the `1 i  > 0,O we obtaink from− an applicationk O of H¨older’s inverse temperature β = log(−1).U ByU a simple inequality that: light-cone argument we can bound the support of † each Zi , since we know the underlying struc- D(σ(µ) σ(λ)) D(σ(µ) σ(λ)) + D(σ(λ) σ(µ)) ture ofU theU circuit. We then show in Thm. C.1 k ≤ k k = (βn) . (9) that it is indeed possible to efficiently learn the O outputs of such circuits as long as the support We refer to Section A for more details. Thus, of each time-evolved Z is at most logarithmic in i if we can find a state that approximates the ex- system size. pectation value of each Ei up to , we are guar- We believe that our framework should also anteed to have a (β) relative entropy density. extend to ground states of gapped Hamiltoni- This observation isO vital to ensure that the max- ans, however such a statement would still re- imum entropy principle still yields a good esti- quire us to refine our bounds. More precisely, mate of the state even under some noise in the it is known that ground states of local Hamil- vector of expectation values e(λ). Indeed, the tonians are well approximated in trace distance variational principle of Eq. (7) would allow us to by Gibbs states whose inverse temperature scales recover the state exactly if we had access to the like (log(n−1)) [36] under mild assumptions. O exact values of e(λ). However, it turns out that Thus, obtaining a TC inequality for such Gibbs solving Eq. (7) with some estimatee ˆ(λ) such that states would suffice for our purposes. However, eˆ(λ) e(λ) `∞  still yields a Gibbs state σ(µ) as known TC inequalities have an exponential de- ksatisfying− Eq.k (9).≤ Solving the maximum entropy pendency on the inverse temperature, this scal- problem is a strictly convex optimization prob- ing would hinder an exponential speedup for the lem. Thus, it can be solved efficiently with ac- sample complexity and just applying Pinkser’s cess to the gradient of the target function, which inequality would give a bound that is just as turns out to be proportional to e(λ) e(µ), where good. µ is the current guess for the optimum.− Although Nevertheless, results like [37] assert that there we will discuss the details of solving the problem exists a state that approximates all k-local re- later in Sec. A, in a nutshell the maximum en- duced density matrices of the up to tropy problem can be solved efficiently if it is pos-  and is the output of a circuit whose depth de- sible to efficiently compute expecation values of 6 the observables Ei on the family of Gibbs states expectation values of the Ei on σ(µ), then the under consideration. postprocessing can also be done in polynomial time.

D. Combining TC with the maximum This gives an exponential speedup in the sam- entropy principle ple complexity over known results for the tomog- raphy of many-body states [3–5, 7, 38], which have a polynomial sample complexity to learn all Suppose now that we have that each of the k-body density matrices. Theorem I.1 also pro- Ei acts on at most l0 qudits. Then, by using vides an exponential improvement over shadow e.g. the method of classical shadows, we can techniques in the locality of the observables. estimate the expectation values of all Ei up to However, unlike our methods, shadow techniques  with failure probability at most δ > 0 with do not need to make any assumptions on the un- l0 −2 −1 (4  log(mδ )) samples. From our discus- derlying states. sionO above, we see that this is enough to obtain a state σ(µ) satisfying Eq. (9). Further assuming We also remark that it is possible to improve the scaling in accuracy in Eq. (11) from −4 to the that we have a TC with some constant α > 0 for −2 σ(λ) we conclude that: expected  . In order to do that, it is important to bound the condition number of the Hessian of

tr [O(σ(λ) σ(µ))] O LipW1(σ(λ), σ(µ)) the log-partition function. That is, we need a | − | ≤ k k bound of the form D(σ(µ) σ(λ)) O k 2 ≤ k kLip 2α LI log( (µ)) UI (13) r ≤ ∇ Z ≤ = ( βn O Lip). for constants L, U > 0 and all µ B (0, 1). We O k k ∈ `∞ p refer to Sec. A of the supplemental material for Finally, recall that for sums of k local oper- a thorough discussion. We note that in [5] the ators on a 2D lattice like in Fig. 1, where we 2 authors show such results in a more general set- have k = L , the Lipschitz constant satisfies ting, although with U, L polynomial in n, which O = (√n) and we require a precision of k kLip O is not sufficient for our purposes. For us it will be (n/k) to obtain a relative error of . Putting important to ensure that the condition number allO these elements together, we conclude that by 2 2 of the log-partition function is at most polylog- setting  = β ˜ /(k ) for some ˜ > 0 we arrive at arithmic in system size (i.e. L−1U = ˜(1)). It then follows that O tr [O(σ(λ) σ(µ))] = (˜ k−1n), (10) | − | O λ µ = L−1β e(λ) e(µ) which constitutes a relative error for the expecta- k − k`2 k − k`2 tion value. In particular, we see that the sample That is, whenever the expectation values are complexity required to obtain this error was close, the underlying parameters must be close as well. In this case, we have from e(λ) e(µ) = (4l0 k4β2 ˜−4 log(m)). (11) k − k`2 O (√m) that: O We then obtain: β λ µ e(λ) e(µ) (14) − h − | − i Theorem I.1 (Learning Gibbs states). Let σ(λ) −1 2 2 β λ µ `2 e(λ) e(µ) `2 = (L β  m). be a Gibbs state as defined in Eq. (6) and such ≤ k − k k − k O −1 β −2 that each Ei acts on at most l0 qudits. Moreover, As we will see later, have that L = (e β ) suppose that σ(λ) satisfies a TC inequality with for high temperature commuting Hamiltonians.O α depending at most inverse logarithmically with The quadratic improvement in  in Eq. (14) then system size. Then with probability of success 1 yields the expected −2 scaling in the accuracy. − δ we can obtain a state σ(µ) such that for all For commuting Hamiltonians we further show in observables O dn Prop. F.1 and Lemma F.1 of the supplementary ∈ M material that: tr [O(σ(λ) σ(µ))] (√n O Lip) (12) | − | ≤ O k k Proposition I.1 (Strengthened strong convex- from (4l0 β2 poly(−1, log(mδ−1))) samples of ity constant for commuting Hamiltonians). For σ(λ).O Moreover, if it is possible to compute the each µ B (0, 1), let σ(µ) be a Gibbs state at ∈ `∞ 7 inverse temperature β corresponding to the com- II. SUMMARY OF THE MAXIMUM ENTROPY PROCEDURE AND muting Hamiltonian H(µ) = j µj Ej on the hypergraph G = (V,E), where tr [E E ] = 0 for CONTRIBUTIONS P i j all i, j and each local operator Ej is traceless on its support. Then for β such that the states σ(µ) Now that we have discussed how our results satisfy exponential decay of correlations, the Hes- yield better sample complexities for some classes sian of the log-partition function is bounded by of states, we discuss the maximum entropy algo- rithm in more detail and comment on how our results equip it with better performance guaran- (β2)I 2 log (µ) Ω(β2 e−cβ)I. (15) O ≥ ∇ Z ≥ tees. The flowchart in Figure 2 gives the general for some constant c > 0. scheme behind the maximum entropy method. Besides the exponential improvements in sample complexity laid out in Table II, we also provide After the completion of the first version of this structural improvements which we elaborate on work, in [20, Corollary 4.4] Tang et al proved while also explaining the general scheme: strong convexity bounds for high-temperature, a. Input: The input consists of m linearly (not necessarily geometrically) local Hamiltoni- independent operators Ei of operator norm 1, ans. More precisely, for β = (k−8), where k some upper bound β > 0, precision parameter O is the maximal number of qudits each term acts  > 0 and step size η. Moreover, we are given the on, they show that L−1 2β−2. Although the promise that the state of interest satisfies (6). Al- ≤ result in Eq. (15) has the advantage of giving an though we will be mostly concerned with the case estimate at any temperature, we see that also in which the observables are local, we show the for noncommuting Gibbs states strong convexity convergence of the algorithm in general in Sec. A. holds at high enough temperatures. The step size should be picked as η = (U −1) with U satisfying (13), as explained in Sec.O A 1. b. Require: We assume that we have access L to copies of σ(λ) and that we can perform mea- surements to estimate the expectation values of �" �! the observables Ei up to precision  > 0. For L most of the examples considered here, this will ... mostly require implementing simple, few-qudit measurements. c. Output: The output is in the form of a vector of parameters µ of a Gibbs state σ(µ) as ...... in Eq. (6). Note that unlike [5], our goal is not to estimate the vector of parameters λ, but rather obtain an approximation of the state satisfying P FIG. 1. example of observable O = i Oi for 2D σ(λ) σ(µ). Here we will focus on quantifying lattice system of size n. Each Oi is supported on the output’s' quality in relative entropy. More a L × L square (L = 3 in the figure). We have √ 2 precisely, the output of the algorithm is guaran- kOkLip = O( n) and kOk = n/L . Thus our −1 teed to satisfy D(σ(µ) σ(λ)) = (n). methods require poly(L, log(n),  ) samples to es- k O timate the expectation value of all such observables. d. Step 1: In this step, we estimate the ex- cL2 −1 Shadow-like methods require poly(e , log(n),  ) pectation values of each observable Ei on the samples, an exponentially worse dependency in L. state σ(λ) up to an error . The resources to Even for moderate values of L, say L = 5, this can be optimized here are the number of samples of lead to 107 factor savings sample complexity and σ(λ) we require and the complexity to imple- gives an exponential speedup for L = poly(log(n)). −1 ment them. Using shadow tomography or Pauli Other many-body methods have a poly(L, n,  ) [3– grouping methods [9, 10, 39] we can do so re- 5, 7, 38] scaling, which in turn is exponentially worse quiring (4r0 −2 polylog(m)) samples and Pauli in the system size. or 1 qubitO measurements, where r is maximum − 0 number of qubits the Ei act on. This is discussed 8

Structure Samples Postprocessing Lipschitz-constant −4 6l0 Lightcone l0 circuit  4 exp(3n) ∇ Commuting Gibbs −2  poly(n) ∇,  for β < βc −2 O(β) −4 1D commuting min{ e ,  } poly(n) ∇,  −1 −2 Gibbs for β < βc = O((kg) )  poly(n) loc(k, g)

TABLE√ II. Performance of the algorithm under various assumptions of the underlying state to obtain a state that is  n close in Wasserstein distance. All the estimates are up to polylog(n) factors. We refer to Sec. B 3 for proofs of the TC used and C for how to combine them with maximum entropy methods. In Section D we explain how to obtain the sample complexity by combining Thm. A.1 with strong convexity bounds. For the postprocessing we refer to section E. The case of shallow circuits is discussed in more detail in Sec. C 1. By lightcone l0 we mean the size of the largest lightcone of each qubit in the circuit. in more detail in Sec. C. in a number of steps depending polynomially in e. Step 2: The maximum entropy problem m and logarithmically on the tolerance  for a in Eq. (7) can be solved with gradient descent, as fixed β. Here we improve their bound in sev- it corresponds to a strictly convex optimization eral directions. First, we show that the algo- problem [40]. At this step we simply initialize rithm converges after a polynomial in m number the algorithm to start at the maximally mixed of iterations for arbitrary Ei, albeit with a poly- state. nomial dependence on the error, as discussed in Sec. A 2. We then specialize to certain classes f. Step 3: It turns out that the gradient of of states to obtain various improvements. For the maximal entropy problem target function at high-temperature, commuting Hamiltonians we µ is proportional to e(µ ) e(λ). Thus, to imple- t t provide a complete picture and show that the ment an iteration of gradient− descent, it is nec- condition number of the Hessian is constant in essary to compute e(µ ), as we assumed we ob- t Prop. I.1. This implies that gradient descent tained an approximation of e(λ) in Step 1. More- converges in a number of iterations that scales over, it is imperative to ensure that the algorithm logarithmically in system size and error. also converges with access to approximations to e(µ) and e(λ). This is because most algorithms h. Stopping condition and recovery guarantees: the stopping condition, to compute e(µt) only provide approximate val- ues [21–23, 41] . In addition, they usually have a e(λ) e(µ ) √m , poor scaling in the accuracy [22], making it nec- k − t k`2 ≤ essary to show that the process converges with can be immediately converted to a relative en- rough approximations to e(µt) and e(λ). Here tropy bound between the target state and the we show that it is indeed possible to perform current iterate by the identity (8). This justifies this step with only approximate computations of its choice as a stopping criterion. expectation values. This allows us to identify Since we already discussed the sample com- classes of states for which the postprocessing can plexity of the maximum entropy method, let us be done efficiently. These results are discussed in now discuss some of its computational aspects. more detail in Sec. A 2. There are two quantities that govern the com- g. Convergence loop: Now that we have plexity of the algorithm: how many iterations seen how to compute one iteration of gradient we need to perform until we converge and how descent, the next natural question is how many expensive each iteration is. iterations are required to reach the stopping cri- As the maximum entropy problem is strongly terium. As this is a strongly convex problem, convex, one can show that (UL−1 log(n−1)) it- the convergence speed depends on the eigenval- erations suffice to converge.O Here again U, L are ues of the Hessian of the function being opti- bounds on the Hessian as in Eq. (13). Neverthe- mized [40, Section 9.1.2]. For max-entropy, this less, we also show how to bypass requiring such corresponds to bounding the eigenvalues of a gen- bounds in Sec. A 1 and obtain that the maximum eralized covariance matrix. In [5] the authors entropy algorithm converges after (m−2) iter- already showed such bounds for local Hamilto- ations without any locality assumptionsO on the nians implying the convergence of the algorithm Ei or strong convexity guarantees. That is, the 9

Input: set of operators E ,...,E , β > 0, error 1 m Require: ability to pre- Output: estimate tolerance  > 0, step size η. pare σ(λ) µ of λ Target state σ(λ) exp( β λ E ). ∝ − i i i P Step 1: obtain estimate of Step 2: Initialize µ0 = 0 Step 3: Estimate e(µt) t t + 1 e(λ) = tr [σ(λ)Ei] up to  →

Output µt Is e(µ ) e(λ) √m? Set µt+1 = µt η(e(µt) e(λ)) yes k t − k ≤ no − −

1 FIG. 2. Flowchart for general maximum entropy algorithms. number of iterations is at most polynomial in m. of [44] imply that high-temperature Gibbs states Let us now discuss the cost of implementing can be prepared with a circuit of depth logarith- each iteration of the algorithm on a classical com- mic in system size. Thus, by using the same method we used to estimate e(λ) we can also es- puter. This boils down to estimating e(µt) for the current iterate, which can be achieved in various timate e(µt) by using the copies provided by the ways. In the worst case, it is possible to just quantum computer. The complexity of the post- compute the matrix exponential and the expec- processing for shadows is linear in system size. tation values directly, which yields a complexity Thus, with access to a quantum computer we can of (d3nm). However, for many of the classes perform the post-processing for each iteration in ˜ −2 consideredO here it is possible to do this computa- time O(m ). As in this case we showed that tion in polynomial time. For instance, in [22] the the number of iterations is ˜(1), we conclude O authors show that for high-temperature Gibbs that we can perform the postprocessing in time states it is possible to approximate the partition ˜(m−2). That is, for this class of systems our O function efficiently. Thus, for the case of high- results give an arguably complete picture regard- temperature Gibbs states, not necessarily com- ing the postprocessing, as it can be performed in muting ones, we can do the postprocessing effi- a time comparable to writing down the vector of ciently. It is also worth mentioning net- parameters, up to polylogarithmic factors. Fur- work techniques to estimate e(µt). As we only thermore, given that the underlying Gibbs states require computing the expectation value of local are known to satisfy TC and Prop. I.1 gives es- observables, recent works show that it is possible sentially optimal bounds on the covariance ma- to find tensor network states of constant bond di- trices, we believe that the present work essen- mension that approximate all expectation values tially settles the question of how efficiently we of a given locality well [37, 42, 43]. From such can learn such Hamiltonians and corresponding a representation it is then possible to compute Gibbs states. We discuss this in more detail in e(µt) efficiently in the 1D case by contracting the Sec. F. underlying tensor network. Unfortunately, how- Finally, an example of our bounds is illustrated ever, in higher dimensions the contraction still in Fig. 3, where we show that the number of sam- takes exponential time. Table II provides a sum- ples required to estimate a local observable to mary of the complexity of the postprocessing for relative precision is essentially system-size inde- various classes. pendent. It is also worth considering the complexity of the postprocessing with access to a quantum computer, especially for commuting Hamiltoni- ans. As all high-temperature Gibbs states sat- isfy exponential decay of correlations, the results 10

Scaling of error in estimating Lipschitz observable tance, which is mediated by a transportation cost

Pinsker inequality, allows for fine-tuning the complexity Actual value of observables whose expectation value we wish 0.5 to estimate and the number of samples required

0.0 for that. Through our techniques we essentially r o r r

E settled most questions related to how efficiently - 0

1 0.5 g it is possible to learn a commuting Gibbs state. o l

1.0 We believe that the framework we began to develop here will find applications in other ar-

1.5 eas of theory such as in

101 102 103 104 process tomography, quantum System size and in quantum many-body physics. Indeed, in many practical settings demanding a trace dis- FIG. 3. Error in estimating a Lipschitz observable tance bound might be too restrictive, and replac- after performing the maximum entropy reconstruc- ing it by a Wasserstein distance bound can lead tion method. The underlying state is a classical 1D- to substantial gains, as in this article. Gibbs state with randomly chosen nearest-neighbor interactions and inverse temperature β = 1. We es- Some of the outstanding open questions raised timated all the ZiZi+1 expectation values from the by this article are establishing that a suitable no- original state based from 103 samples of the original tion of exponential decay of correlations in gen- state. We then computed the upper bound on the eral implies a transportation cost inequality and trace distance predicted by Eq. (8) and Pinsker’s in- showing that TC holds for a larger class of sys- equality and compared it to the actual of discrepancy tems. Moreover, it would also be interesting to for a Lipschitz observable on the reconstructed and investigate other applications of Gaussian con- actual state. The Lipschitz observable was chosen as centration in many-body physics [24, 32–35] from P n−1UZ Z U †, where we picked U as a depth 3 i i i+2 the angle of transportation cost inequalities. quantum circuit. We observe that the error incurred is essentially independent of system size, and we get good predictions even when the number of samples is smaller than it. IV. ACKNOWLEDGEMENTS

DSF was supported by VILLUM FONDEN via III. CONCLUSION the QMATH Centre of Excellence under Grant No. 10059. The research of CR has been sup- In this article we have demonstrated that ideas ported by project QTraj (ANR-20-CE40-0024- from quantum optimal transport can yield expo- 01) of the French National Research Agency nential improvements in the sample complexity (ANR) and by a Junior Researcher START Fel- required to learn a quantum state when com- lowship from the MCQST. DSF and CR are pared to state of the art methods. More precisely, grateful to Richard Kueng, Fernando Brand˜ao we showcased how the interplay between maxi- and Giacomo De Palma for interesting discus- mum entropy methods and the Wasserstein dis- sions.

[1] Ryan O’Donnell and John Wright. Efficient [3] Marcus Cramer, Martin B. Plenio, Steven T. quantum tomography. In Daniel Wichs and Flammia, Rolando Somma, David Gross, Yishay Mansour, editors, Proceedings of the 48th Stephen D. Bartlett, Olivier Landon-Cardinal, Annual ACM SIGACT Symposium on Theory of David Poulin, and Yi-Kai Liu. Efficient quan- Computing, STOC 2016, Cambridge, MA, USA, tum state tomography. Nature Communications, June 18-21, 2016, pages 899–912. ACM, 2016. 1(1):149, dec 2010. [2] Jeongwan Haah, Aram Wettroth Harrow, [4] T Baumgratz, A N¨ußeler,M Cramer, and M B Zhengfeng Ji, Xiaodi Wu, and Nengkun Yu. Plenio. A scalable maximum likelihood method Sample-optimal tomography of quantum states. for quantum state tomography. New Journal of IEEE Trans. Inf. Theory, 63(9):5628–5641, Physics, 15(12):125004, dec 2013. 2017. 11

[5] Anurag Anshu, Srinivasan Arunachalam, To- arXiv:2101.03037, 2021. motaka Kuwahara, and Mehdi Soleimanifar. [18] Giacomo De Palma and Cambyse Rouz´e. Sample-efficient learning of interacting quantum Quantum concentration inequalities. systems. Nature Physics, may 2021. arXiv:2106.15819 [math-ph, physics:quant- [6] Giacomo Torlai, Guglielmo Mazzola, Juan Car- ph], June 2021. arXiv: 2106.15819. rasquilla, Matthias Troyer, Roger Melko, and [19] E. T. Jaynes. Information Theory and Statisti- Giuseppe Carleo. Neural-network quantum state cal Mechanics. Physical Review, 106(4):620–630, tomography. Nature Physics, 14(5):447–450, May 1957. may 2018. [20] Jeongwan Haah, Robin Kothari, and Ewin [7] Jun Wang, Zhao-Yu Han, Song-Bo Wang, Tang. Optimal learning of quantum Hamiltoni- Zeyang Li, Liang-Zhu Mu, Heng Fan, and Lei ans from high-temperature Gibbs states, 2021. Wang. Scalable quantum tomography with fi- arXiv:2108.04842v1. delity estimation. Phys. Rev. A, 101:032321, [21] M. Kliesch, C. Gogolin, M.J. Kastoryano, A. Ri- Mar 2020. era, and J. Eisert. Locality of Temperature. [8] Jens Eisert, Dominik Hangleiter, Nathan Walk, Physical Review X, 4(3):031019, July 2014. Ingo Roth, Damian Markham, Rhea Parekh, [22] Tomotaka Kuwahara, Kohtaro Kato, and Fer- Ulysse Chabaud, and Elham Kashefi. Quantum nando G. S. L. Brand˜ao. Clustering of condi- certification and benchmarking. Nature Reviews tional mutual information for quantum Gibbs Physics, 2(7):382–390, jul 2020. states above a threshold temperature. Phys. [9] Hsin-Yuan Huang, Richard Kueng, and John Rev. Lett., 124:220601, Jun 2020. Preskill. Predicting many properties of a quan- [23] Aram W Harrow, Saeed Mehraban, and Mehdi tum system from very few measurements. Na- Soleimanifar. Classical algorithms, correlation ture Physics, June 2020. decay, and complex zeros of partition functions [10] Jordan Cotler and Frank Wilczek. Quantum of quantum many-body systems. In Proceedings overlapping tomography. Physical Review Let- of the 52nd Annual ACM SIGACT Symposium ters, 124(10):100401, 2020. on Theory of Computing, pages 378–386, 2020. [11] Andrew Jena, Scott Genin, and Michele Mosca. [24] Tomotaka Kuwahara and Keiji Saito. Gaussian Pauli Partitioning with Respect to Gate Sets. concentration bound and ensemble equivalence arXiv:1907.07859 [quant-ph], July 2019. arXiv: in generic quantum many-body systems includ- 1907.07859. ing long-range interactions. Annals of Physics, [12] Ophelia Crawford, Barnaby van Straaten, 421:168278, 2020. Daochen Wang, Thomas Parks, Earl Campbell, [25] Tomotaka Kuwahara, Alvaro´ M. Alhambra, and and Stephen Brierley. Efficient quantum mea- Anurag Anshu. Improved Thermal Area Law surement of Pauli operators in the presence of and Quasilinear Time Algorithm for Quantum finite sampling error. arXiv:1908.06942 [quant- Gibbs States. Physical Review X, 11(1):011047, ph], April 2020. arXiv: 1908.06942. March 2021. [13] Eric A. Carlen and Jan Maas. Non-commutative [26] Angela´ Capel, Cambyse Rouz´e, and calculus, optimal transport and functional in- Daniel Stilck Fran¸ca. The modified logarithmic equalities in dissipative quantum systems. Jour- sobolev inequality for quantum spin systems: nal of Statistical Physics, 178(2):319–378, nov classical and commuting nearest neighbour 2019. interactions. arXiv preprint arXiv:2009.11817, [14] Cambyse Rouz´eand Nilanjana Datta. Concen- 2020. tration of quantum states from quantum func- [27] M. Talagrand. Transportation cost for Gaus- tional and transportation cost inequalities. Jour- sian and other product measures. Geometric and nal of Mathematical Physics, 60(1):012202, Jan- Functional Analysis, 6(3):587–600, may 1996. uary 2019. [28] C´edric Villani. Optimal Transport, volume [15] Li Gao, Marius Junge, and Nicholas 338 of Grundlehren der mathematischen Wis- LaRacuente. Fisher Information and Log- senschaften. Springer Berlin Heidelberg, Berlin, arithmic Sobolev Inequality for Matrix- Heidelberg, 2009. Valued Functions. Annales Henri Poincar´e, [29] Maxim Raginsky and Igal Sason. Concentration 21(11):3409–3478, November 2020. of Measure Inequalities in Information Theory, [16] Giacomo De Palma, Milad Marvian, Dario Tre- Communications, and Coding. Now Publishers, visan, and Seth Lloyd. The quantum wasserstein Norwell, MA, 2014. OCLC: 1193114287. distance of order 1. IEEE Transactions on In- [30] the definition of [16] has a different√ normaliza- formation Theory, pages 1–1, 2021. tion and does not have the n term. This nor- [17] Bobak Toussi Kiani, Giacomo De Palma, Mi- malization will be convenient to treat the con- lad Marvian, Zi-Wen Liu, and Seth Lloyd. stants of [14] and [16] on an equal footing. Quantum earth mover’s distance: A new ap- [31] S.G Bobkov and F G¨otze.Exponential Integra- proach to learning quantum data. arXiv preprint bility and Transportation Cost Related to Log- 12

arithmic Sobolev Inequalities. Journal of Func- mal States. Communications in Mathematical tional Analysis, 163(1):1–28, April 1999. Physics, 365(1):1–16, January 2019. [32] Fernando G. S. L. Brandao and Marcus [45] Brian Swingle and Isaac H. Kim. Reconstructing Cramer. Equivalence of Statistical Mechanical quantum states from local data. Physical Review Ensembles for Non-Critical Quantum Systems. Letters, 113(26):260501, December 2014. arXiv: arXiv:1502.03263 [cond-mat, physics:quant-ph], 1407.2658. February 2015. arXiv: 1502.03263. [46] Christian Kokail, Rick van Bijnen, Andreas El- [33] Anurag Anshu. Concentration bounds for quan- ben, Benoˆidt Vermersch, and Peter Zoller. En- tum states with finite correlation length on tanglement Hamiltonian Tomography in Quan- quantum spin lattice systems. New Journal of tum Simulation, 2020. arXiv:2009.09000v1. Physics, 18(8):083011, aug 2016. [47] M. B. Hastings. Quantum belief propagation: [34] Hal Tasaki. On the Local Equivalence Between An algorithm for thermal quantum systems. the Canonical and the Microcanonical Ensem- Physical Review B, 76(20), nov 2007. bles for Quantum Spin Systems. Journal of Sta- [48] Scott Aaronson. Shadow tomography of quan- tistical Physics, 172(4):905–926, August 2018. tum states. In STOC’18—Proceedings of the [35] Tomotaka Kuwahara and Keiji Saito. Eigen- 50th Annual ACM SIGACT Symposium on The- state Thermalization from the Clustering Prop- ory of Computing, pages 325–338. ACM, New erty of Correlation. Physical Review Letters, York, 2018. 124(20):200604, May 2020. [49] Akram Youssry, Christopher Ferrie, and Marco [36] Matthew B. Hastings and Tohru Koma. Spec- Tomamichel. Efficient online quantum state es- tral Gap and Exponential Decay of Corre- timation using a matrix-exponentiated gradient lations. Communications in Mathematical method. New J. Phys., 21(3):033006, 2019. Physics, 265(3):781–804, August 2006. [50] Fernando G. S. L. Brand˜ao, Amir Kalev, [37] Alexander M. Dalzell and Fernando G. S. L. Tongyang Li, Cedric Yen-Yu Lin, Krysta M. Brand˜ao.Locally accurate MPS approximations Svore, and Xiaodi Wu. Quantum SDP solvers: for ground states of one-dimensional gapped lo- large speed-ups, optimality, and applications to cal Hamiltonians. Quantum, 3:187, September quantum learning. In 46th International Collo- 2019. quium on Automata, Languages, and Program- [38] B. P. Lanyon, C. Maier, M. Holz¨apfel,T. Baum- ming, volume 132 of LIPIcs. Leibniz Int. Proc. gratz, C. Hempel, P. Jurcevic, I. Dhand, A. S. Inform., pages Art. No. 27, 14. Schloss Dagstuhl. Buyskikh, A. J. Daley, M. Cramer, M. B. Ple- Leibniz-Zent. Inform., Wadern, 2019. nio, R. Blatt, and C. F. Roos. Efficient tomog- [51] Fernando G. S. L. Brand˜ao,Richard Kueng, and raphy of a quantum many-body system. Nature Daniel Stilck Fran¸ca. Fast and robust quantum Physics, 13(12):1158–1162, September 2017. state tomography from few basis measurements. [39] Xavier Bonet-Monroig, Ryan Babbush, and arXiv:2009.08216 [quant-ph], September 2020. Thomas E O’Brien. Nearly optimal measure- arXiv: 2009.08216. ment scheduling for partial tomography of quan- [52] Eric A Carlen and Jan Maas. An analog of tum states. Physical Review X, 10(3):031064, the 2-wasserstein metric in non-commutative 2020. probability under which the fermionic fokker– [40] Stephen Boyd, Stephen P Boyd, and Lieven planck equation is gradient flow for the en- Vandenberghe. Convex optimization. Cambridge tropy. Communications in mathematical physics, university press, 2004. 331(3):887–926, 2014. [41] M. B. Hastings. Quantum belief propagation: [53] Eric A Carlen and Jan Maas. Gradient flow and An algorithm for thermal quantum systems. entropy inequalities for quantum Markov semi- Physical Review B, 76(20):201102, November groups with detailed balance. Journal of Func- 2007. tional Analysis, 273(5):1810–1869, 2017. [42] Alvaro´ M. Alhambra and J. Ignacio Cirac. [54] Maxim Raginsky and Igal Sason. Concentra- Locally accurate tensor networks for thermal tion of measure inequalities in information the- states and time evolution. arXiv:2106.00710 ory, communications and coding. Foundations [cond-mat, physics:quant-ph], June 2021. arXiv: and Trends in Communications and Information 2106.00710. Theory; NOW Publishers: Boston, MA, USA, [43] Yichen Huang. Locally accurate matrix 2018. product approximation to thermal states. [55] M. B. Hastings. Locality in Quantum Sys- arXiv:2106.03854 [cond-mat, physics:math- tems. arXiv:1008.5137 [math-ph, physics:quant- ph, physics:quant-ph], June 2021. arXiv: ph], August 2010. arXiv: 1008.5137. 2106.03854. [56] David Poulin. Lieb-Robinson Bound and Local- [44] Fernando G. S. L. Brand˜ao and Michael J. ity for General Markovian Quantum Dynamics. Kastoryano. Finite Correlation Length Im- Physical Review Letters, 104(19):190401, May plies Efficient Preparation of Quantum Ther- 2010. 13

[57] Thomas Barthel and Martin Kliesch. Quasilo- [63] Anurag Anshu, Srinivasan Arunachalam, Tomo- cality and efficient simulation of markovian taka Kuwahara, and Mehdi Soleimanifar. Ef- quantum dynamics. Physical review letters, ficient learning of commuting hamiltonians on 108(23):230504, 2012. lattices. Unpublished notes avaible at Anurag [58] Martin Kliesch, Christian Gogolin, and Jens Anshu’s website, (link to note). Eisert. Lieb-Robinson Bounds and the Sim- [64] Roland L Dobrushin and Senya B Shlosman. ulation of Time-Evolution of Local Observ- Completely analytical interactions: construc- ables in Lattice Systems. In Volker Bach and tive description. Journal of Statistical Physics, Luigi Delle Site, editors, Many-Electron Ap- 46(5):983–1014, 1987. proaches in Physics, Chemistry and Mathemat- [65] Huzihiro Araki. Gibbs states of a one dimen- ics, pages 301–318. Springer International Pub- sional quantum lattice. Communications in lishing, Cham, 2014. Series Title: Mathematical Mathematical Physics, 14(2):120–157, 1969. Physics Studies. [66] Marc Vuffray, Sidhant Misra, Andrey Lokhov, [59] Nathael Gozlan and Christian L´eonard. Trans- and Michael Chertkov. Interaction screening: port inequalities. a survey. arXiv preprint Efficient and sample-optimal learning of ising arXiv:1003.3852, 2010. models. In Advances in Neural Information Pro- [60] Ivan Bardet, Angela´ Capel, Li Gao, Angelo Lu- cessing Systems, pages 2595–2603, 2016. cia, David P´erez-Garc´ida,and Cambyse Rouz´e. [67] Andrea Montanari et al. Computational impli- Entropy decay for davies semigroups of a one di- cations of reducing data to sufficient statistics. mensional quantum lattice. in preparation, 2021. Electronic Journal of Statistics, 9(2):2370–2390, [61] Salman Beigi, Nilanjana Datta, and Cam- 2015. byse Rouz´e. Quantum reverse hypercontractiv- [68] Michael J Kastoryano and Fernando GSL Bran- ity: its tensorization and application to strong dao. Quantum Gibbs samplers: the commuting converses. Communications in Mathematical case. Communications in Mathematical Physics, Physics, 376(2):753–794, 2020. 344(3):915–957, 2016. [62] Kristan Temme, Fernando Pastawski, and [69] Ivan Bardet, Angela Capel, and Cambyse Rouz´e. Michael J Kastoryano. Hypercontractivity Approximate tensorization of the relative en- of quasi-free quantum semigroups. Journal tropy for noncommuting conditional expecta- of Physics A: Mathematical and Theoretical, tions. arXiv preprint arXiv:2001.07981, 2020. 47(40):405303, October 2014. 14

SUPPLEMENTAL MATERIAL

This is the supplemental material to ”Learning many-body states from very few copies”. We will start in Sec. A with a review of the basic properties of the maximum entropy principle to learn quantum states. This is followed by a discussion of Lipschitz constants, Wasserstein distances and transportation cost inequalities in Sec. B. After that, in Sec. C we more explicitely discuss the interplay between the maximum entropy method and transportation cost inequalities. We then briefly discuss scenarios in which the postprocessing required for the maximum entropy method can be performed efficiently in Sec. E. Finally, in Sec. F we discuss a class of examples were we show that all technical results required to obtain the strongest guarantees of our work hold, that is, Gibbs states of commuting Hamiltonians at high enough temperature and 1D commuting Hamiltonians.

We start by setting some basic notations. Throughout this article, we denote by k the algebra of k sa M k k matrices on C , whereas k denotes the subspace of self-adjoint matrices. The set of quantum × k M n states over C is denoted by k. Typically, k will be taken as d for n qudit systems. The trace D on k is denoted by tr. Given two quantum states ρ, σ, we denote by S(ρ) = tr [ρ log(ρ)] the vonM Neumann entropy of ρ, and by D (ρ σ) the relative entropy between ρ and σ−, i.e. D(ρ σ) = tr [ ρ (log(ρ) log(σ))] whenever the supportk of ρ is contained in that of σ and + otherwise.k The − ∞ trace distance is denoted by ρ σ := tr [ ρ σ ] and the operator norm of an observable by O ∞. k − ktr | − | k k Scalar products are denoted by . Moreover, we denote the `p norm of vectors by `p , and for x Rm and r R, B (x, r) denotesh·|·i the ball of radius r in ` norm around x. The identityk · k matrix ∈ ∈ `p p is denoted by I. The adjoint of an operator A is denoted by A† and that of a channel Φ with respect to the trace inner product by Φ∗. For a hypergraph G = (V,E) we will denote the distance between subsets of vertices induced by the hypergraph by dist.

Appendix A: Maximum entropy principle for quantum Gibbs states

One of the main aspects of this work concerns the effectiveness of the maximum entropy method for the tomography of quantum Gibbs states in various settings and regimes. Thus, we start by recalling some basic properties of the maximum entropy method. Our starting assumption is that the target state is well-described by a quantum Gibbs state with respect to a known set of operators and that we are given an upper bound on the inverse temperature: m Definition A.1 (Gibbs state with respect to observables). Given a set of observables = Ei i=1, sa E { } E1,...,Em dn being linearly independent with Ei ∞ 1, we call a state σ dn a Gibbs state at inverse temperature∈ M β > 0 if there exists a vectorkλ k Rm≤ with λ 1 such∈ that: D ∈ k k`∞ ≤ m m σ = exp β λ E / (λ), where (λ) = tr exp β λ E (A1) − i i Z Z − i i i=1 ! " i=1 !# X X denotes the partition function. In what follows, we will denote σ by σ(λ) and i λiEi = H(λ), where the dependence of σ(λ) on β is implicitly assumed. P We are mostly interested in the regime where m dn. Then the above condition can be interpreted as imposing that the matrix log(σ) is sparse with respect to a known basis . A canonical example for such states are Gibbs states of local Hamiltonians on a lattice, for whichE m = (n) and the O observables Ei are taken as tensor products of Pauli matrices acting on neighboring sites. But we could also consider a basis consisting of quasi-local operators or some subspace of Pauli strings. Next, we review some basic facts about quantum Gibbs states. One of their main properties is that they satisfy a well-known maximum entropy principle [19]. This allows us to simultaneously show that the expectation values of the observables completely characterize the state σ(λ) and further provides us with a variational principle to learn a descriptionE from which we can infer an approximation of other expectation values. Let us start with the standard formulation of the maximum entropy principle: 15

Proposition A.1 (Maximum entropy principle). Let σ(λ) n be a quantum Gibbs state (A1) with ∈ Dd respect to the basis at inverse temperature β and introduce ei(λ) := tr [σ(λ)Ei] for i = 1, . . . , m. Then σ(λ) is the uniqueE optimizer of the maximum entropy problem:

maximize S(ρ) (A2) ρ∈Ddn

subject to tr [Eiρ] = ei(λ) for all i = 1, . . . , m.

Moreover, σ(λ) optimizes:

m

min log( (µ)) + β µiei(λ) . (A3) µ∈Rm Z k k ≤ i=1 µ `∞ 1 X Proof. The proof is quite standard, but we include it for completeness. Note that for any state ρ = σ that is a feasible point of Eq. (A2) we have that: 6

S(σ(λ)) S(ρ) = D(ρ σ(λ)) + tr [(ρ σ(λ)) log(σ(λ))] − k − m = D(ρ σ(λ)) β λ tr [E (ρ σ(λ))] k − i i − i=1 X = D(ρ σ(λ)) > 0 , k where we have used the fact that tr [Ei (ρ σ(λ))] = 0 for all feasible points and that the relative entropy between two different states is strictly− positive. This shows that σ(λ) is the unique solution of (A2). Eq. (A3) is nothing but the dual program of Eq. (A2).

Eq. (A2) above gives a variational principle to find a quantum Gibbs state corresponding to certain expectation values. As it is well-known, one can use gradient descent to solve the problem in Eq. (A3), as it is a strongly convex problem. Various recent works have discussed learning of Gibbs states [5, 45, 46] and it is certainly not a new idea to do so through maximum entropy methods. Nevertheless, we will discuss how to perform the postprocessing in more detail, as some recent results allow us to give this algorithm stronger performance guarantees. Finally, it should be said that although we draw inspiration from [5], our main goal will be to learn a set of parameters µ Rm such that the Gibbs states σ(µ) and σ(λ) are approximately the same on sufficiently regular observables∈ while optimizing the sample complexity. This is in contrast to the goal of [5], which was to learn the vector of parameters λ. Learning λ corresponds to a stronger requirement, in the sense that if the vectors of parameters are close, then the underlying states are also close, as made precise in the following Prop. A.2. One of the facts that we are going to often exploit is that it is possible to efficiently estimate the relative entropy between two Gibbs states σ(λ) and σ(µ) given the parameters λ, µ and the expectation values of observables in . This also yields an efficiently computable bound on the trace distance. Indeed, as observed in [5],E we have that:

Proposition A.2. Let σ(µ), σ(λ) dn be Gibbs states with respect to a set of observables at inverse temperature β. Denote e(λ)∈ = D (tr [σ(λ)E ]) Rm. Then E i i ∈ σ(µ) σ(λ) 2 D(σ(µ) σ(λ)) + D(σ(λ) σ(µ)) = β λ µ e(λ) e(µ) . (A4) k − ktr ≤ k k − h − | − i Proof. The equality in Eq. (A4) follows from a simple manipulation. Indeed:

m D(σ(µ) σ(λ)) + D(σ(λ) σ(µ)) = β (λ µ ) tr [[σ(µ) σ(λ)]E ] . k k i − i − i i=1 X The bound on the trace distance then follows by applying Pinkser’s inequality. 16

The statement of Proposition A.2 allows us to obtain quantitative estimates on how well a given Gibbs state approximates another one in terms of the expectation values of known observables. In particular, a simple application of H¨older’sinequality shows that if two Gibbs states are such that tr [E [σ(µ) σ(λ)]] , then the sum of their relative entropies is at most | i − | ≤ β λ µ e(λ) e(µ) β λ µ m 2m β , (A5) |h − | − i| ≤ k − k`∞ ≤ where the outer bound arises from our assumption that λ `∞ , µ `∞ 1. Moreover, it is straight- forward to relate the difference of the target function ink Eq.k (A3)k k evaluated≤ at two vectors to the difference of relative entropies between the target state and their corresponding Gibbs states:

Lemma A.1. Let σ(λ) n be a Gibbs state with respect to a set of observables at inverse ∈ Dd E temperature β > 0 and define, for any µ Rm, ∈ m f(µ) := log( (µ)) + β e (λ) µ . (A6) Z i i i=1 X m Then for any other two vectors µ, ξ R with µ `∞ , ξ `∞ 1: ∈ k k k k ≤ f(µ) f(ξ) = D(σ(λ) σ(µ)) D(σ(λ) σ(ξ)). − k − k Proof. The proof follows from straightforward manipulations.

Thus, we see that a decrease of the target function f when solving the max entropy problem is directly connected to the decrease of the relative entropy between the target state and the current iterate. We will later use this to show the convergence of gradient descent for solving the max entropy problem with arbitrary . However, before that we discuss how the convergence of the state σ(µ) to σ(λ) is related to the convergenceE of the parameters µ to λ.

1. Strong convexity and convergence guarantees

The maximum entropy problem (A2) being a convex problem, it should come as no surprise that properties of the Hessian of the function being optimized are vital to understanding its complexity and stability. For the maximum entropy problem, the Hessian at a point is given by a generalized covariance matrix corresponding to the underlying Gibbs state. As the results of [5] showcase, the eigenvalues of such covariance matrices govern both the stability of Eq. (A3) with respect to µ and the convergence of gradient descent to solve it. To see why, we recall some basic notions of optimization of convex functions and refer to [40] for an overview.

Definition A.2 (Strong convexity). Let C Rm be a convex set. A twice differentiable function f : C R is called strongly convex with parameters⊂ U, L > 0 if we have for all x C that: → ∈ UI 2f(x) LI. ≥ ∇ ≥ The optimization of strongly convex functions is well-understood. Indeed, we have:

Proposition A.3. Let C Rm be a convex set and f : C R be strongly convex with parameters ⊂ → L, U as in the definition above. Then, for all  > 0, the optimal value α := minx∈C f(x) is achieved up to error  by the gradient descent algorithm initiated at x0 C with step size U −1 after S steps for ∈ U f(x0) α S log − . (A7) ≤ L    17

k 2 Moreover, the gradient norm satisfies f(x ) δ after S∇ steps with k∇ k`2 ≤ U 2L(f(x0) α) S∇ log − . ≤ L δ   Finally, we have for all µ, λ C that: ∈ µ λ L−1 f(µ) f(λ) . (A8) k − k`2 ≤ k∇ − ∇ k`2 Proof. These are all standard results which can be found e.g. in [40, Section 9].

To see the relevance of these results for the maximum entropy problem, we recall the following Lemma:

m Lemma A.2. Let C = B (0, 1) R , let σ(λ) n be a Gibbs state with respect to a set of `∞ ⊂ ∈ Dd operators at inverse temperature β and define f : C R as in Eq. (A6). Then: E → ( f(µ)) = β tr [(σ(λ) σ(µ))E ] (A9) ∇ i − i and β2 2f(µ) = tr E , Φ (E ) σ(µ) β2 e (µ)e (µ) , (A10) ∇ ij 2 { j H(µ) i } − i j    with

+∞ −iH(µ)t iH(µ)t ΦH(µ)(Ei) = νβ(t)e Eie dt −∞Z where νβ(t) is a probability density function whose Fourier transform is given by:

βω 2 tanh 2 νˆ (ω) = . β βω 

Proof. The quantum belief propagation theorem [47] states that:

∂ −βH(λ) β −βH(λ) e = e , ΦH(λ)(Ei) . ∂λi − 2 { } The claim then follows from a simple computation.

Thus, we see that in order to compute the gradient of the target function f for the maximum entropy problem, we simply need to compute the expectation values of observables on the current state and on the target state. Moreover, the Hessian is given by a generalized covarianceE matrix of the quantum Gibbs state. That this should indeed be interpreted as a covariance matrix is most easily seen by considering commuting Hamiltonians. Then indeed we have:

2f(µ) = β2 [tr [σ(µ)E E ] e (µ)e (µ)] . ∇ ij i j − i j For any Gibbs state it holds that:  Rm Proposition A.4. For all µ B`∞ (0, 1) , inverse temperature β > 0 and set of operators of cardinality m, we have: ∈ ⊂ E

2f(µ) 2β2m I . ∇ ≤ 18

Proof. Note that

tr E σ(µ)eiH(µ)tE e−iH(µ)t 1 , i j ≤ h i by H¨older’sinequality, the submultiplicativity and unitary invariance of the operator norm and the fact 2 that Ei ∞ 1. Similarly, we have that ei(µ) , ej(µ) 1. Thus, by Lemma A.2, f(µ) ij 2 k k 2≤ | | | | ≤ | ∇ 2 | ≤ 2β . As f(µ) is an m m matrix, it follows from Gershgorin’s circle theorem that f(µ) 2β2mI. ∇ × ∇ ≤

The proof above also showcases how exponential decay of correlations can be used to sharpen estimates on the maximal eigenvalue of 2f, since in that case 2f(µ) will have exponentially ∇ ∇ ij decaying entries. We discuss this in more detail when we focus on many-body states, for which we  also consider the more challenging question of lower bounds.

2. Convergence with approximate gradient computation and expectation values

Proposition A.3 already establishes the convergence of gradient descent whenever we can compute the gradient exactly and have a bound on L. Moreover, we see from Lemma A.2 that, in order to compute the gradient of the function f above, it suffices to estimate local expectation values. Moreover, it is a standard result that gradient descent is strictly decreasing for strictly convex problems [40]. However, in many settings it is only possible or desirable to compute the expectation values of quantum Gibbs states approximately. Moreover, the expectation values of the target state are only known up to statistical fluctuations. It is then not too difficult to see that gradient descent still offers the convergence guarantees if we only approximately compute the gradient. We state the exact convergence guarantees and precision requirement for completeness.

Theorem A.1 (Computational complexity and convergence guarantees). Let σ(λ) n be a quan- ∈ Dd tum Gibbs state at inverse temperature β with respect to a set of operators and CE be the computa- tional cost of computing e0(µ) satisfying E

e0(µ) e(µ) δ k − k`2 ≤ µ for µ B (0, 1) and δ > 0. Moreover, assume that we are given an estimate e0(λ) of e(λ) satisfying ∈ `∞ µ 0 e(λ) e (λ) ∞  (A11) k − k ≤ and that the partition function is strongly convex with parameters U, L. Then gradient descent starting 1 0 at µ = 0 with step size cU and input data e (λ) converges to a state σ(µ∗) satisfying:

2 σ(λ) σ(µ∗) D(σ(λ) σ(µ∗)) + D(σ(µ∗) σ(λ)) k − ktr ≤ k k = (βδ min √m, βL−1δ + β min 1,L−1β m). O µ { µ} { } in

−2 −2 UCE −1 min UCE β n log(d)δ , log(n ) O µ L    time.

We will prove this theorem at the end of this section, as before we will need some auxilliary state- ments. But the reader familiar with basic concepts from convex optimization should feel comfortable to skip them. 19

Proposition A.5 (Convergence of gradient descent with constant relative precision). Let σ(λ) dn be a quantum Gibbs state at inverse temperature β with respect to a set of operators , and for a∈ Gibbs D state σ(µ) let z(µ) Rm be a vector such that E ∈ β z(µ) β(e(λ) e(µ)) ` e(µ) e(λ) ` (A12) k − − k 2 ≤ 4ck − k `2 for some c > 10. Then we have that:

9β2 e(µ) e(λ) 2 D(σ(λ) σ(µ z(µ) )) D(σ(λ) σ(µ)) k − k`2 , (A13) k − cU − k ≤ − 10 c U where U is a uniform bound on the operator norm of the Hessian of the function f defined in Eq. (A6).

Proof. From a Taylor expansion and strong convexity we have for any two points µ, ξ that:

U f(ξ) f(µ) + f(µ) (ξ µ) + ξ µ 2 . ≤ h∇ | − i 2 k − k`2

z 1 Note that f(µ) = β (e(λ) e(µ)) by Eq. (A9). Setting ξ = µ cU = µ + cU ( f(µ) + f(µ) z) we obtain:∇ − − −∇ ∇ − z f µ − cU   1 1 1 f(µ) f 2 + f(µ) f(µ) z + f(µ) + f(µ) z 2 ≤ − cU k∇ k`2 cU h∇ |∇ − i 2c2U k − ∇ ∇ − k`2 1 1 1 f(µ) f 2 + f(µ) f(µ) z + ( f(µ) + f(µ) z )2 , ≤ − cU k∇ k`2 cU k∇ k`2 k∇ − k`2 2c2U k∇ k`2 k∇ − k`2 where in the last step we used the Cauchy-Schwarz inequality. By our assumption in Eq. (A12) for z z(µ) we have: ≡ 1 1 1 f(µ) f 2 + f(µ) f(µ) z + ( f(µ) + f(µ) z )2 − cU k∇ k`2 cU k∇ k`2 k∇ − k`2 2c2U k∇ k`2 k∇ − k`2 1 1 (1 + (4c)−1)2 f(µ) f 2 + f(µ) 2 + f(µ) 2 ≤ − cU k∇ k`2 4c2U k∇ k`2 2c2U k∇ k`2 1 1 (1 + (4c)−1)2 = f(µ) 1 f(µ) 2 − cU − 4c − 2c k∇ k`2   −1 2 1 (1+(4c) ) 1 and it can be readily checked that 4c + 2c 10 for c 10. To conclude the proof, note that by Lemma A.1: ≤ ≥

D(σ(λ) σ(µ z )) D(σ(λ) σ(µ)) = f(µ z )) f(µ) , k − cU − k − cU − and insert f(µ) 2 = β2 e(µ) e(λ) 2 . k∇ k`2 k − k`2 Thus, we see that we make constant progress for the gradient descent algorithm if we only compute the derivative up to constant relative precision. We show now how to pick our stopping criterium based on approximate computations of the gradient which ensure convergence in polynomial time.

Proposition A.6. Let σ(λ) dn be a quantum Gibbs state at inverse temperature β with respect to a set of operators . Suppose∈ D that at each time step t of gradient descent we compute an estimate 0 E e (µt) of e(µt) that satisfies

0 e (µt) e(µt) ` δµ , k − k `2 ≤ 20 and set the stopping criterium to be:

0 e(λ) e (µ∗) < (4c + 1)δ . k − k`2 µ for some constant c > 10. Then gradient descent starting at µ = 0 with update rule µt+1 := µt 0 β(e(λ)−e (µt)) − cU will converge to a state σ(µ∗) satisfying:

2 σ(λ) σ(µ∗) D(σ(λ) σ(µ∗)) + D(σ(µ∗) σ(λ)) 2(4c + 1)βδ √m k − ktr ≤ k k ≤ µ after at most (Uβ−2n log(d)δ−2) iterations. O µ Proof. First, we show that the relative precision bound required for Proposition A.5 holds under these assumptions. By our choice of the stopping criterium, at each time step we have the property that, while we did not stop,

e(λ) e(µ ) = e(µ) e0(µ ) + e0(µ ) e(λ) k − t k`2 k − t t − k`2 e0(µ ) e(λ) e(µ ) e0(µ ) ≥ k t − k`2 − k t − t k`2 (4c + 1)δ δ ≥ µ − µ = 4cδµ

0 by the reverse triangle inequality. As we assumed we have that e (µt) e(µt) `2 δµ, it follows that −1 0 k − k ≤ (4c) e(λ) e(µt) `2 (e (µt) e(λ)) (e(µt) e(λ)) `2 . Multiplying the inequality by β, we see k − k ≥ k − − − k 0 that the conditions of Proposition A.5 are satisfied for z(µt) := β(e(λ) e (µt)). Let us now show the convergence. By our choice of initial point, we have that: −

D(σ(λ) σ(0)) n log(d) . k ≤ Now, suppose that we did not stop before T iterations. It follows from a telescopic sum argument and Proposition A.5 that:

9β2(4c + 1)2δ2 D(σ(λ) σ(µ )) n log(d) T µ , k T ≤ − 10 c U 0 since e (µt) e(λ) (4c + 1)δµ at all iterations because we did not halt. As the relative entropy k − k ≥ −2 −1 −2 is positive, it follows that T = (β Uc n log(d)δµ ) before the stopping criterium is met. The recovery guarantee whenever theO stopping criterium is met follows from Proposition A.2, the Cauchy- Schwarz inequality and the equivalence of norms λ µ √m λ µ 2√m. k − k`2 ≤ k − k`∞ ≤ Since we proved in Proposition A.4 that U = (β2m), it follows that the number of iterations is (nm). Thus, we see that having a lower bound onO the Hessian is not necessary to ensure convergence, butO it can speed it up exponentially: Proposition A.7 (Exponential convergence of gradient descent with approximate gradients). In the same setting as Proposition A.5 we have:

z(µ) 18L f µ f(λ) 1 (f(µ) f(λ)). (A14) − cU − ≤ − 10cU −    

In particular, gradient descent with approximate gradient computations starting at µ0 = 0 converges after U log(n−1) iterations to µ such that f(µ) f(λ) . O L − ≤ Proof. For any strongly convex function f and points µ, ξ C we have that: ∈ L f(ξ) f(µ) + f(µ) (ξ µ) + µ ξ 2 . ≥ h∇ | − i 2 k − k`2 21

As explained in [40, Chapter 9], the R.H.S. of the equation above is a convex quadratic function of ξ ˜ 1 for µ fixed. One can then easily show that its minimum is achieved at ξ = µ L f(µ). From this we obtain: − ∇ 1 β2 e(µ) e(λ) 2 f (µ) f(λ) f(µ) 2 = k − k`2 , − ≥ −2Lk∇ k`2 − 2L where the last identity follows from Eq. (A9). By subtracting f(λ) from both sides of the inequal- ity (A13) in Proposition A.5 and rearranging the terms we have that:

z(µ) 9β2 e(µ) e(λ) 2 f µ f(λ) f(µ) f(λ) k − k`2 − cU − ≤ − − 10cU   18L f (µ) f(λ) (f (µ) f(λ)) ≤ − − 10cU − 18L = 1 (f(µ) f(λ)) . − 10cU −   This yields the claim in Eq. (A14). To obtain the second claim, note that applying Eq. (A14) iteratively z(µk−1) yields that after k iterations we have that, for µ = µ − , k k 1 − cU 18L k (f(µ ) f(λ)) 1 (f(0) f(λ)) . k − ≤ − 10cU −   By our choice of initial point and Lemma A.1 we have that f(0) f(λ) = (n), which yields the claim solving for k and noting that log 1 18L = Ω( L ). − O − − 10cU U Remark A.1 (Comparison to mirror descent) . It is also worth noting that the convergence guarantees of Proposition A.6 and update rules of gradient descent are very similar to the ones of mirror descent with the von Neumann entropy as potential, another algorithm used for learning of quantum states [48–51]. In this context, mirror descent would use a similar update rule. However, instead of computing the whole gradient, i.e. all expectation values of the basis, for one iteration, mirror descent just requires us to find one i such that e (λ) e (µ) δ and updates the Hamiltonian in the direction i. This | i − i | ≥ implies that the algorithm can be run online while we still estimate some other ei, but we will not analyse this variation more in detail here. Finally, we assumed so far that we knew the expectation values of the target state, e(λ), exactly. However, it follows straightforwardly from Proposition A.6 that knowing each expectation value up to an error  is sufficient to ensure that the additional error due to statistical fluctuations is at most 0 m. More precisely, if we have that e(λ) e (λ) ∞  for some  > 0, then any Gibbs state σ(µ∗) 0 k − k ≤ satisfying e(µ∗) e (λ) δ satisfies: k − k`2 ≤ D(σ(λ) σ(µ∗)) + D(σ(µ∗) σ(λ)) 2βδ√m + 2βm k k ≤ by Proposition A.2 and a Cauchy-Schwarz inequality. With these statements at hand we are finally ready to prove Thm. A.1.

Proof of Thm. A.1. We will show in Propositions A.6, A.7 that under the conditions outlined above, the maximum entropy problem will converge to a µ∗ that satisfies:

0 e (λ) e(µ∗) (4c + 1)δ . k − k`2 ≤ µ Without making any assumptions on L we can then bound

D(σ(λ) σ(µ∗)) + D(σ(µ∗) σ(λ) = β λ µ∗ e(λ) e(µ∗) k k |h − | − i| 0 β ( λ µ∗ e(λ) e(µ∗) + λ µ∗ e (λ) e(µ∗) ) ≤ |h − | − i| |h − | − i| (4c + 1)δ √m + 2βm ≤ µ 22 by H¨olderinequality and our assumptions on e0(λ). Let us now discuss how strong convexity can improve these estimates. First note that by strong convexity and Cauchy-Schwarz we have:

D(σ(λ) σ(µ∗)) + D(σ(µ∗) σ(λ) = β λ µ∗ e(λ) e(µ∗) k k |h − | − i| β e(λ) e(µ∗) λ µ∗ ≤ k − k`2 k − k`2 −1 2 2 L β e(λ) e(µ∗) ≤ k − k`2 −1 2 0 0 2 L β ( e (λ) e(µ∗) + e(λ) e (λ) ) ≤ k − k`2 k − k`2 which yields the claim.

In short, we see that we can perform the recovery by simply computing the gradient approximately. In particular, as already hinted at in [5], this implies that recent methods developed to approximately compute the partition function of high-temperature quantum Gibbs states can be used to perform the postprocessing in polynomial time [21–24]. This and other methods to compute the gradient are discussed in more detail in Sec. E. Furthermore, it should be noted that usually L = Ω(β−2) in the high temperature regime, making the bound independent of β for such states. We refer to Sec. D for a summary of the cases for which bounds on L are known.

Appendix B: Lipschitz constants and transportation cost inequalities

In this section, we identify conditions under which it is possible to estimate all expectation values of k-local observables up to an error  by measuring (poly(k, log(n), −1)) copies of it, where n is the system size, which constitutes an exponential improvementO in some regimes. To obtain this result, we combine the maximum entropy method introduced in Section A with techniques from quantum optimal transport. In order to formalize and prove the result claimed, we resort to transportation cost inequalities and the notion of Lipschitz constants of observables, which we now introduce.

1. Lipschitz constants and Wasserstein metrics

Transportation cost inequalities, introduced by Talagrand in the seminal paper [27], constitute one of the strongest tools available to show concentration of measure inequalities. In the quantum setting, their study was initiated in [14, 16, 52, 53]. Here we are going to show how they can also be used in the context of quantum tomography. On a high level, a transportation cost inequality for a state σ quantifies by how much the relative entropy with respect to another state ρ is a good proxy to estimate to what extent the expectation values of sufficiently regular observables differ on the states. As maximum entropy methods allow for a straightforward control of the convergence of the learning in relative entropy (cf. Section A), they can be combined to derive strong recovery guarantees. But first we need to define what we mean by a regular observable. We start by a short discussion of Lipschitz constants and the Wasserstein-1 distance. To obtain an intuitive grasp of these concepts, one way is to first recall the variational formulation of the trace distance of two quantum states σ, ρ:

ρ σ tr = sup tr [P (ρ σ)] . † k − k P =P ,kP k∞≤1 −

Seeing probability distributions as diagonal quantum states, we recover the variational formulation of the total variation distance by noting that we may restrict to diagonal operators P . Thus, the total variation distance quantifies by how much the expectation values of arbitrary bounded functions can differ under the two distributions. However, in many situations we are not interested in expectation values of arbitrary bounded observables, but rather observables that are sufficiently regular. E.g., 23 most observables of physical interest are (quasi)-local. Thus, it is natural to look for distance measures between quantum states that capture the notion that two states do not differ by much when restricting to expectation values of sufficiently regular observables. These concerns are particularly relevant in the context of tomography protocols, as they should be designed to efficiently obtain a state that reflects the expectation values of extensive observables of the system. As we will see, one of the ways of ensuring that the sample complexity of the tomography algorithm reflects the regularity of the observables we wish to recover is through demanding a good recovery in the Wasserstein distance of order 1 [14, 16]. In the classical setting [29, Chapter 3], one way to define a Wasserstein-1 distance between two probability distributions is by replacing the optimization over all bounded diagonal observables by that over those that are sufficiently regular: given a metric d on a sample space Ω, we define the Lipschitz constant of a function f :Ω R to be: → f(x) f(y) f Lip := sup | − | . k k x,y∈Ω d(x, y)

Denoting the Wasserstein-1 distance by W1, it is given for two probability measures p, q on Ω by

W1(p, q) := sup Ep(f) Eq(f) . (B1) f:kfkLip≤1 | − |

That is, this metric quantifies by how much the expectation values of sufficiently regular functions can vary under p and q, in clear analogy to the variational formulation of the trace distance. We refer to [28, 29] for other interpretations and formulations of this metric.

a. Quantum Hamming Wasserstein distance

It is not immediately clear how to generalize these concepts to noncommutative spaces. There are by now several definitions of transport metrics for quantum states equivalent [14, 16–18]. As already noted in the main text, de Palma et al. defined the Lipschitz constant of an observable O dn as [16]: ∈ M

√ O Lip, = n max max tr [O(ρ σ)] . (B2) k k 1≤i≤n ρ,σ∈Ddn − tri[ρ]=tri[σ]

That is, the Lipschitz constant quantifies the amount by which the expectation value of an observable can change when evaluated on two states that only differ on one site. This is in analogy with the Lipschitz constants induced by the Hamming distance on the hypercube, so we denote it with . Note that in our definition we added the √n factor, which will turn out to be convenient later. Armed with this definition, we can immediately obtain an analogous definition of the Wasserstein distance in Eq. (B1) for two states ρ, σ:

W1,(ρ, , σ) := sup tr [O(ρ σ)] . (B3) O=O†,kOk ≤1 − Lip, The authors of [16] also put forth the following equivalent expression for the norm:

n (i) n (i) (i) sa (i) min X 1 : ρ σ = X ,X n , tri[X ] = 0 W (ρ, σ) = i=1 k k − i=1 ∈ Md . (B4) 1, 2√n P P It follows from an application of H¨older’sinequality combined with the variational formulation in

Eq. (B2) that O Lip, 2√n O ∞. However, it can be the case that O Lip √n O ∞, which are exactly thosek k observables≤ thatk k should be thought of as regular. Thisk k is because k thisk signals 24 that changing the state locally leads to significantly smaller changes to the expectation value of the observable than global ones. Two examples of this behaviour are given by the observables:

n n

O1 = Ii Zj, and O2 = Zi . i=1 6 i=1 X Oj=i X ⊗n Clearly, O1 ∞ = O2 ∞ = n. On the other hand, by considering the states ρ = 0 0 and k k ⊗kn−1k | ih | σ = 1 1 0 0 , we see that O1 Lip, √n (2n 2) whereas O2 Lip, = 2√n. More | ih | ⊗ | ih | k k n ≥ − k k generally, it is not difficult to see that if O = Oi with Oi ∞ 1 and we denote by supp(Oi) the i=1 k k ≤ qudits on which Oi acts nontrivially, then: P

O Lip, 2√n max supp(Oi) j . k k  ≤ 1≤j≤n | ∩ { }| i X That is, the maximal number of intersections of the support of Oi on one qubit. From these examples we see that for local observables, the ratio of the operator norm and Lipschitz constant reflects the locality of the observable.

b. Quantum differential Wasserstein distance

The Wasserstein distance W1, generalizes the classical Orstein distance, that is the Wasserstein distance corresponding to the Hamming distance on bit strings. Another definition of a Lipschitz constant and its attached Wasserstein distance was put forth in [14], where the construction is based on a differential structure that bears more resemblance to that of the Lipschitz constant of a differentiable function on a continuous sample space, e.g. a smooth manifold [54]. Let us now define the notion of a noncommutative differential structure (see [13]):

Definition B.1 (Differential structure). A set of operators L n and constants ω R defines k ∈ Md k ∈ a differential structure L , ω ∈K for a full rank state σ n if { k k}k ∈ Dd † 1 L ∈K = L ∈K; { k}k { k}k −1 2 L ∈K consists of eigenvectors of the modular operator ∆ (X) := σXσ with { k}k σ −ωk ∆σ(Lk) = e Lk . (B5)

3 L ∞ 1. k kk ≤ Such a differential structure can be used to provide the set of matrices with a Lipschitz constant that is tailored to σ, see e.g. [13, 14] for more on this. In order to distinguish that constant from the one defined in (B2), we will refer to it as the differential Lipschitz constant and denote it by X Lip,∇. It is given by: k k

1/2

−ωk/2 ωk/2 2 X ∇ := (e + e ) [L ,X] . (B6) k kLip, k k k∞ ∈K ! kX

The quantity [Lk,X] should be interpreted as a partial derivative and is sometimes denoted by ∂kX for that reason. Then, the gradient of a matrix A, denoted by A with a slight abuse of notations, ∇ refers to the vector of operator valued coordinates ( A)i = ∂iA. For ease of notations, we will denote the differential structure by the couple ( , σ).∇ The notion of a differential structure is also intimately connected to that of the generator of a∇ quantum dynamical semigroup converging to σ [13], 25 and properties of that semigroup immediately translate to properties of the metric. This is because the differential structure can be used to define an operator that behaves in an analogous way to the Laplacian on a smooth manifold, which in turn induces the heat semigroup. We refer to [13, 14] for more details. To illustrate the differential structure version of the Lipschitz constant, it is instructive to think of the maximally mixed state. In this case, one possible choice would consist of picking the Lk to be all 1 local Pauli strings and ω = 0. Then the Lipschitz constant turns out to be given by: − j 1/2 2 X ∇ = P XP X , (B7) k kLip, k k k − k∞ ∈K ! kX where Pk are all 1 local Pauli matrices. Thus, we see that this measures how much the operator changes if we act locally− with a Pauli unitary on it. If we think of an operator as a function and conjugating with a Pauli as moving in a direction, the formula above indeed looks like a derivative. In fact, it is possible to make this connection rigorous, see [13]. As before, the definition in Eq. (B6) yields a metric on states by duality:

W1,∇(ρ, σ) := sup Tr (X(ρ σ)) . † X=X , kXkLip,∇≤1 | − | It immediately follows from the definitions that for any observable X:

tr [X(ρ σ)] X ∇ W ∇(ρ, σ) . (B8) | − | ≤ k kLip, 1, Although this geometric interpretation opens up other interesting mathematical connections for this definition, the differential Wasserstein distance has the downside of being state dependent. It however induces a stronger topology than the quantum Hamming Wasserstein distance in some situations (see [18, Proposition 5])). In particular, the results of [18, Proposition 5]) imply that for commutative

Gibbs states a TC inequality for W1,∇ implies the corresponding statement for W1,.

c. Wasserstein-like constant generated by quasi-local observables

As we will see in more detail in Section B 2, quasi-local observables are Lipschitz with respect to . Lip,∇ as well as . Lip,. This raises the question of defining the notion of a Wasserstein-like quantityk k for this specifick k class of observables. The reason for this will become clear in Section B 3 where we will be able to easily leverage some recent Gaussian concentration bounds for Gibbs states at high enough temperature [24] in order to prove transportation cost inequalities for noncommutative Gibbs states, even in the case of long-range interactions. Here, we consider the set V := 1 . . . n of spins of local dimension d. Then, we define a class of quasi-local observables: { }

Definition B.2 ((k, g) quasi-local observables). Given k N and g 0, an observable X dn is said to be (k, g) quasi-local if it can be decomposed as ∈ ≥ ∈ M

X = X with X ∞ g for v V, A k Ak ≤ ∈ 3 |AX|≤k AXv where each XA is strictly supported on a set A of cardinality A k. Notice that this decomposition (k,g) | | ≤ is unique. We denote by loc the space of (k, g) quasi-local observables. Moreover, a Gibbs state σ on V is called (k, g) quasi-localM if it is the Gibbs state of a (k, g) quasi-local Hamiltonian:

e−βH σ := , Z where β 0, := tr e−βH for H (k, g) quasi-local. ≥ Z   26

Next, we introduce the concept of a quasi-local Wasserstein-like pseudometric: Definition B.3 (Quasi-local Wasserstein pseudometric). The quasi-local Wasserstein pseudometric on quantum states in n is defined as: Dd 1 W1,loc(ρ, σ) := sup tr [X(ρ σ)] . √n (k,g) | − | X∈Mloc

Note that this is indeed a pseudo-metric, as it cannot distinguish states such that all k-body marginals coincide. Thus, this last version of the Wasserstein distance is the weakest of the three discussed, as it only optimizes over a smaller set of observables.

2. Local evolution of Lipschitz observables

As already discussed in Subsections B 1 a and B 1 b when we defined . Lip,∇ and . Lip,, Lipschitz constants can be easily controlled by making assumptions on the localityk k of the operators.k k Indeed, if we apply a local circuit to a local observable, it is straightforward to control the growth of the Lipschitz constant in terms of the growth of the support of the observable under the circuit. More precisely, in [16, Proposition 13] the authors show such a statement for discrete time evolutions with exact lightcones: if we denote by L the size of the largest lightcone of one qubit under a channel Φ, ∗ n | | then for any observable O d , Φ (O) Lip, 2 L O Lip,. Here we will extend such arguments to the evolution under local∈ Hamiltonians D k k or Lindbladians.≤ | |k k By resorting to Lieb-Robinson bounds, we show that the Lipschitz constants . Lip,∇ and . Lip, of initially local observables evolving according to a quasi-local dynamics increasek atk most withk k the effective lightcone of the evolution. Thus, short- time dynamics and shallow-depth quantum channels can only mildly increase the Lipschitz constant. This further justifies the intuition that observables with small Lipschitz constant reflect physical observables. Lieb-Robinson (LR) bounds in essence assert that the time evolution of local observables under (quasi)-local short-time dynamics have an effective lightcone. There are various formulations of Lieb- Robinson bounds. Reviewing those in detail is beyond the scope of this work and we refer to [55–58] and references therein for more details. For studying the behaviour of . Lip,∇ under local evolutions, the most natural formulation is the commutator version: the generatork k of a quasi-local dynamics t Φ = etL on n qudits arranged on a graph G = (V,E), with graph distanceL dist, is said to satisfy 7→ t a LR bound with LR velocity v if for any observable OA supported on a region A and any other observable B supported on a region B, we have:

∗ vt [Φ (O ),O ] ∞ c (e 1) g(dist(A, B)) O ∞ O ∞ , (LR1) k t A B k ≤ − k Ak k Bk for g : N R+ a function such that limx→∞ g(x) = 0. We then have: → Proposition B.1 (Growth of differential Lipschitz constant for local evolutions). Let ( , σ) be a ∇ differential structure on dn and let O = Oi be an observable with Oi ∞ 1. Let Ai denote the M i k k ≤ support of each Oi and Bj that of each Lj.P Moreover, let t Φt be an evolution satisfying Eq. (LR1) vt 7→ and set o(i, j)(t) = 2 if Ai Bj = ∅ and c (e 1) g(dist(Ai,Bj)) else. Then: ∩ 6 − 2 Φ∗(O) 2 (eωj + e−ωj ) o (t) . k t kLip,∇ ≤ i,j ∈K i ! kX X Proof. The proof follows almost by definition. We have:

Φ∗(O) 2 = (eωj + e−ωj ) [Φ (O),L ] 2 . k t kLip,∇ k t j k∞ ∈J jX 27

By a triangle inequality we have that:

2 ∗ 2 ∗ [Φ (O),L ] [Φ (O ),L ] ∞ k t i k∞ ≤ k t i j k i ! X ∗ For any term in the sum we have [Φ (O ),L ] ∞ 2 by the submultiplicativity of the operator k t i j k ≤ norm, a triangle inequality and Lj ∞ 1. In case Ai and Bj do not overlap, the stronger bound in Eq. (LR1) holds. k k ≤

n To illustrate this bound more concretely, let us take O = i=1 Zi, Lj acting on [j, j + k] for j = 1, . . . , n k, and g(dist(i, j)) = e−µ|i−j|, for some constant µ. I.e. we have a 1D differential structure and− a local time evolution on a 1D lattice. Then for anyP j:

j−1 n (evt 1)e−µ o (t) = k + (evt 1) e−µ|i−j| + e−µ|i−j| k + − . i,j −   ≤ 1 e−µ i i=1 X X i=jX+k+1 −   Thus,

vt −µ ∗ (e 1)e Φ (O) ∇ √n k k + − . k t kLip, ≤ − 1 e−µ  −  We see that for constant times the Lipschitz constant is still of order √nk.

Let us now derive a similar, yet somehow stronger, version of Prop. B.1 for . Lip,. In some situations, bounds like (LR1) can be further exploited in order to prove the quasi-localityk k of Markovian quantum dynamics [57]: for any region C D V , there exists a local Markovian evolution t Φ(D) ⊂ ⊂ 7→ t that acts non-trivially only on region D, and such that for any observable OC supported on region C,

∗ (D) ∗ 0 vt Φ (Φ ) (O ) ∞ c (e 1) h(d(C,V D)) O ∞ , (LR2) k t − t C k ≤ − \ k C k  0 for some other function h : N R+ such that limx→∞ h(x) = 0 and constant c > 0. In other words, →∗ at small times, the channels Φt can be well-approximated by local Markovian dynamics when acting on local observables. Let us now estimate the growth of Lipschitz constants for the definition of [16] given a Lieb-Robinson bound:

Proposition B.2. Assume that Φt satisfies the bound (LR2). Then, for any two quantum states ρ, σ n and any ordering 1, . . . , n of the graph: ∈ Dd { } n W (Φ (ρ), Φ (σ)) 8 + 2 c0 (evt 1) h(d( i n , 1 )) W (ρ, σ) . (B9) 1, t t ≤ − { ··· } { } 1, i=3  X  Moreover, for any observable H n , ∈ Md n Φ∗(H) 8 + 2 c0 (evt 1) h(d( i n , 1 )) H . (B10) k t kLip, ≤ − { ··· } { } k kLip, i=3  X  Proof. From [16], the Wasserstein distance W arises from a norm . , i.e. W (ρ, σ) = ρ σ . 1, k kW1 1, k − kW1 Moreover, the norm . W1 is uniquely determined by its unit ball n, which in turn is the convex hull of the set of the differencesk k between couples of neighboring quantumB states:

(i) (i) = , = ρ σ : ρ, σ n , tr (ρ) = tr (σ) . Nn Nn Nn { − ∈ Dd i i } ∈ i[V 28

Now by convexity, the contraction coefficient for this norm is equal to

sa,0 Φt W1→W1 = max Φt(X) W1 : X dn , X W1 1 = max Φt(X) W1 , k k k k ∈ M k k ≤ X∈Nn k k

sa,0  where n denotes the set of self-adjoint, traceless observables. Let then X . By the expression Md ∈ Nn (B4), and choosing without loss of generality an ordering of the vertices such that tr1(X) = 0, we have 1 n I I Φt(X) W1 tr1···i−1 Φt(X) tr1···i Φt(X) k k ≤ 2√n di−1 ⊗ ◦ − di ⊗ ◦ 1 i=1 Xn 1 † = dµ(Ui) tr1···i−1 (Φt(X) UiΦt(X)Ui ) 2√n ◦ − 1 i=1 Z Xn 1 dµ(U ) [U , tr ··· − Φ (X)] ≤ 2√n i k i 1 i 1 ◦ t k1 i=1 X Z 1 n tr ··· − Φ (X) ≤ √n k 1 i 1 ◦ t k1 i=1 Xn (1) 1 (i−k···n) = tr ··· − (Φ Φ )(X) (B11) √n k 1 i 1 ◦ t − t k1 i=1 X where µ denotes the Haar measure on one qudit, and where (1) follows from the fact that tr1(X) = 0, (i−k···n) ({i−k,··· ,n}) with Φt Φt defined as in Eq. (LR2) with k < i 1. Next, by the variational formulation of the≡ trace distance and Eq. (LR2), we have for i 3 that− ≥ (i−k···n) ∗ (i−k···n)∗ tr1···i−1 (Φt Φt )(X) 1 = max tr X(Φt Φt )(Oi···n) k ◦ − k kOi···nk∞≤1 − ∗ (i−k···n)∗  max (Φt Φt )(Oi···n) ∞ X 1 ≤ kOi···nk∞≤1 k − k k k c0 (evt 1) h(dist( i n , 1 i k 1 )) X ≤ − { ··· } { ··· − − } k k1 (2) 2 c0 (evt 1) h(dist( i n , 1 i k 1 )) √n X , ≤ − { ··· } { ··· − − } k kW1 where (2) follows from [16, Proposition 6]. By picking k = i 2 and inserting this estimate into (i−−k···n) Eq. (B11) for i 3 and the trivial estimate tr1···i−1 (Φt Φt )(X) 1 2 X 1 for i = 1, 2 we obtain Eq. (B9).≥ Eq. (B10) follows by duality.k ◦ − k ≤ k k

3. Transportation cost inequalities

Although interesting on their own, the relevance of the Lipschitz constants introduced above be- comes clearer in our context when we also have a transportation cost inequality [54, 59]. A quantum state σ satisfies a transportation cost inequality with constant α > 0 if for any other state ρ it holds that: 1 W1(ρ, σ) D(ρ σ) , (B12) ≤ r2α k where W1 W1,,W1,∇,W1,loc . In what follows, we simply write . Lip and W1 to denote either of the Lipschitz∈ { constants, and their} corresponding Wasserstein metrics,k k defined above. This inequality should be thought of as a stronger version of Pinsker’s inequality that is tailored to a state σ and the underlying Wasserstein distance. 29

One of the well-established techniques to establish a transportation cost inequality for W1,∇ is by exploiting the fact that it is implied by a functonal inequality called the modified logarithmic Sobolev inequality. It is beyond the scope of this paper to explain this connection and we refer to e.g. [14] and references therein for a discussion on these topics. But for our purposes it is important to note that in [26] the authors and Capel show modified logarithmic Sobolev inequalities for several classes of Gibbs states. More recently, one of the authors and De Palma derived such transpotation cost inequalities for W1, in [18]. In Theorem B.1 below we summarize the regimes for which transportation cost inequalities are known to hold:

Theorem B.1 (transportation cost for commuting Hamiltonians [18, 26, 60]). Let E ,...,E n 1 m ⊂ Md be a set of k-local linearly independent commuting operators with Ei ∞ 1. Then σ(λ) satisfies a transportation cost inequality with constant α > 0 for all λ B (0k, 1)k in≤ the following cases: ∈ `∞

(i) The Ei are classical or nearest neighbour (i.e. k = 2) on a regular lattice and the inverse tem- perature where only depends on and the dimension of the lattice, for both β < β βc k W1,∇,W1, and α = Ω(1) [26].

(ii) The operators Ei are local with respect to a hypergraph G = (V,E) and the inverse temperature satisfies β < βc, where βc only depends on k and properties of the interaction hypergraph for and [18, Theorem 3, Proposition 4]. W1, α = Ω(1) The are one-dimensional and , for both and and −1 [60]. (iii) Ei β > 0 W1,∇ W1, α = Ω(log(n) ) Moreover, the underlying differential structure ( , σ) consists of L acting on at most (k) qudits. ∇ k O Theorem B.1 establishes that transportation cost inequalities are satisfied for most classes of com- muting Hamiltonians known to have exponential decay of correlations. Remark . B.1 In [18, Proposition 5], the authors show that W1, c(k) W1,∇ holds up to some constant c(k) depending on the locality of the differential structure. This≤ implies that a transportation cost inequality for W1,∇ implies one for W1, with the same constant. Thus, although the authors of [26, 60] only obtain the result for W1,∇, we can use it to translate it to W1,. We conclude that for commuting Hamiltonians TC are available for essentially all classes of local Hamiltonians for which they are expected to hold. In order to consider a larger class of states that satisfy a transportation cost inequality, it is fruitful to restrict the analysis to quasi-local observables and consider W1,loc. In this case, the following result holds:

Theorem B.2 (Transportation cost for quasi-local Wasserstein distances). For (k, g) N R+, we 1 ∈ × let σ be a (k, g) quasi-local Gibbs state at inverse temperature β < βc := 8e3 g k . Then, the following holds for all ρ n : ∈ Dd 2g W (ρ, σ) D(ρ σ) . 1,loc ≤ β β k r c − (k,g) Proof. We resort to [24, Theorem 1], where it is showed that for β < βc, any X loc and τ β β such that tr [Xσ] = 0, ∈ M | | ≤ c − 2 2 τ | |≤ XA ∞ τ gn ln tr e−τX σ A k k k . ≤ β β τ ≤ β β τ Pc − − | | c − − | |     Next, by Golden-Thompson’s inequality as well as the variational formulation of the relative entropy, we have τ 2gn τ tr [ρX] D(ρ σ) ln tr e−τX+ln σ ln tr e−τX σ . − − k ≤ ≤ ≤ β β τ c − − | |       30

Next, we can choose τ = α tr[ρX] for some parameter α to be chosen later, so that − gn α (β β) tr [ρX]2 c − D(ρ σ) . β β + α gn ≤ k c − The result follows after taking α = β β, so that τ = βc−β , and then optimizing over observables c − 2 X (k,g). ∈ Mloc Remark B.2. Theorem B.2 has the advantage that it holds for essentially all high-temperature Gibbs states, even with long-range interactions. However, its weakness is also apparent: it only allows us to obtain recovery guarantees for observables with support on at most k qubits, while the other transportation cost inequalities allowed for arbitrarily large supports. By exploiting the trade-off between the locality and the temperature, it is possible to extend the result to observables with higher locality by suitably rescaling the inverse temperature range.

Appendix C: Combining the maximum entropy method with transportation cost inequalities

With these tools at hand, we are now ready to show that resorting to transportation cost inequalities it is possible to obtain exponential improvements in the sample complexity required to learn a Gibbs state. First, let us briefly review shadow tomography or Pauli regrouping techniques [9–12]. Although these methods all work under slightly different assumptions and performance guarantees, they have in common that they allow us to learn the expectation value of M k-local observables O ,...,O 1 M ∈ 2n such that Oi ∞ 1 up to an error  and failure probability at most 1 δ by measuring M(eO(k) log(Mδ−k1)−k2) copies≤ of the state. − O For instance, for the shadow methods of [9], we obtain a (4k log(Mδ−1)−2) scaling by measuring 1-qubit random bases. The estimate is then obtained by anO appropriate median-of-means procedure for the expectation value of each output string. The computation for obtaining the expectation value of Ei through this method then entails evaluating the expectation value of the observables k −1 −2 on (4 log(Mδ ) ) product states. For k-local observables Ei, evaluating the expectation value O ck −1 −2 of Ei on a product state takes time (e log(Mδ ) ) for some c > 0. Thus, we see that for k = (log(n)) also the postprocessingO can be done efficiently. O The application of such results to maximum entropy methods is then clear: given E1,...,Em assumed to be at most k-local, with probability at least 1 δ we can obtain an estimate e0(λ) of e(λ) satisfying: −

0 1 e (λ) e(λ) m p  (C1) k − k`p ≤ using (4k log(Mδ−1)−2) copies of σ(λ). We then finally obtain our main theorem, Theorem I.1, restatedO here for the sake of clarity:

Theorem C.1 (Fast learning of Gibbs states). Let σ(λ) 2n be an n-qubit Gibbs state at inverse ∈ D m temperature β with respect to a set of k-local operators = Ei i=1 that satisfies a transportation cost inequality with constant α. Moreover, assume thatEµ {log(} (µ)) is L, U strongly convex in 7→ Z B`∞ (0, 1). Then mβ m2 4kα−1−2β log(mδ−1) min , O Ln α n22    samples of σ(λ) are sufficient to obtain a state σ(µ) that satisfies:

1 tr [O(σ(λ) σ(µ)]  n 2 O (C2) | − | ≤ k kLip for all Lipschitz observables O with probability at least 1 δ. − 31

Proof. Using the aforementioned methods of [9] we can obtain an estimate e0(λ) satisfying the guar- antee of Eq. (A11) with probability at least 1 δ. We will now resort to the results of Thm. A.1 to obtain guarantees on the output of the maximum− entropy algorithm. Now we solve the maximum 0 entropy problem with e (λ) and set the stopping criterion for the output µ∗ as

0 e (λ) e(µ∗) < (4c + 1) √m. k − k`2 for c > 10. Then it follows from Thm. A.1 that:

−1 2 D(σ(µ∗) σ(λ)) D(σ(µ∗) σ(λ)) + D(σ(λ) σ(µ∗)) = β min βL  m, m . (C3) k ≤ k k O { } This can then be combined with transportation cost inequalities. Indeed, we have: 

D(σ(µ∗), σ(λ)) tr [O(σ(λ) σ(µ∗)] O LipW1(σ(µ∗), σ(λ)) O Lip . | − | ≤ k k ≤ k k r 2α Inserting our bound on the relative entropy in Eq. (C3) we obtain:

1 − −1 tr [O(σ(λ) σ(µ∗)] = O β min (αL) 2  β m, √mα . | − | O k kLip { }  p p  We conclude by suitably rescaling .

The theorem above yields exponential improvements in the sample complexity to learn the expec- tation value of certain observables for classes of states that satisfy a transportation cost inequality with α = Ω(log(n)−1). As discussed in Sec. B 2, extensive observables that can be written as a sum of l-local observables have a Lipschitz constant that satisfies O = (l√n). Shadow-like methods k kLip O would require (eO(l) log(mδ−1) −2) samples to learn such observables up to a relative error of n. Our methods, however,O require (eO(k) poly(l, −1) log(mδ−1)), which yields exponential speedups in the regime l = poly(log(n), −1).O Of course it should also be mentioned that classical shadows do not require any assumptions on the underlying state. Furthermore, considering the exponential dependency of the sample complexity in the locality for shadow-like methods, we believe that our methods yield practically significant savings already in the regime in which we wish to obtain expectation values of observables with relatively small support. For instance, for high-temperature Gibbs state of nearest neighbor Hamiltonians and observables supported on 15 qubits, shadows require a factor of 107 more samples on observables supported on 15 qubits than solving the maximum entropy problem∼ for obtaining the same precision. On the other hand, previous guarantees on learning quantum many-body states [3, 4, 7, 38] required a polynomial in system size precision to obtain a nontrivial approximation, which implies a polynomial- time complexity. Thus, our recovery guarantees are also exponentially better compared to standard many-body tomography results.

1. Results for shallow circuits

Let us be more explicit on how to leverage our results to also cover the outputs of short depth circuits. To this end, let G = (V,E) be a graph that models the interactions of a unitary circuit and suppose we implement L layers of an unknown unitary circuit consisting of 1 and 2-qubit unitaries layed out according to G. That is, we have an unknown shallow circuit of depth L with respect to G. More precisely, U

= , (C4) U U`,e ∈E `Y∈[L] eO` 32

0 where each ` E are a subset of the edges such that any e, e ` do not share a vertex. Our goal is to show howE ⊂ to approximately learn the state ∈ E ψ = 0 ⊗n . | i U | i The overall idea consists in finding a Gibbs state approximating ψ in Wasserstein distance. We will then find a differential structure for (approximations) of shallow| circuitsi and then showing that the (approximation of the) output satisfies a TC inequality with respect to it. Thus, it suffices to control the relative entropy with this approximation to ensure a good approximation in Wasserstein distance. Let us find the appropriate approximation. First, note that for β = log(−1) and H = n Z  0 − i i where Zi denotes the Pauli matrix Z = 0 0 1 1 acting on site i, we have that: | ih | − | ih | P e−βH0 e−βH0 D 0 0 ⊗n † † = D 0 0 ⊗n U| ih | U U tr [e−βH0 ] U | ih | tr [e−βH0 ]     eβZ = nD 0 0 | ih | tr [eβZ ]   = nβ 0 Z 0 + n ln tr eβZ −  h | | i −2β = n ln 1 + e    ne−2β ≤  = 2n. (C5)

−β H Thus, if the states e  0 † satisfy a transportation cost inequality with some constant α, this U tr[e−βH0 ]U would allow us to conclude that

−βH0 ⊗n † e † n W1 0 0 ,  . U| ih | U U tr [e−βH0 ]U ≤ 2α   r n † e−βH0 † e−βHU Moreover, defining HU = Zi , we see that = . As we know the geom- − i U U U tr[e−βH0 ]U tr[e−βH0 ] † etry of , we can bound theP support of Zi . Thus, we only need to find a suitable transportation cost inequalityU to see that this approximationU U fits into our framework. −β H Let us now find a suitable differential structure for the state e  0 . For simplicity, denote by tr[e−βH0 ] e−βH0 ⊗n eβ τp = p 0 0 + (1 p) 1 1 and note that = τ with p = β −β . Moreover, let | ih | − | ih | tr[e−βH0 ] p e  +e  a = I⊗i−1 0 1 I⊗n−i−1 i ⊗ | i h | ⊗ 1 † be the anihilation operator acting on qubit i. Defining L = (p(1 p)) 4 a and L = L , we get a i,0 − i i,1 i,0 differential structure for τ ⊗n with L , ω for i = 1, . . . , n, k = 0, 1, ω = p and ω = 1−p . p { i,k i,k} i,0 1−p i,1 p That this is indeed a differential structure follows from a simple computation. One can readily check that the induced Lipschitz constant is given by: n p 1 p O 2 = + − ( [O,L ] 2 + [O,L ] 2) k kLip,∇ 1 p p k i,0 k k i,1 k i=1 r − r  Xn = [O, a ] 2 + [O, a†] 2. k i k k i k i=1 X Thus, we see that the Lipschitz constant takes a particularly simple form for this differential structure. † Moreover, it is not difficult to see that Li,k , ωi,k provides a differential structure for the state ⊗n † {U U } τp . Indeed, it is easily checked that this new differential structure still gives eigenvectors of the U U ⊗n modular operator. Importantly, the result of [61, Theorem 19] establish that the state τp satisfies a 1 transportation cost inequality with constant 2 . Putting all these elements together we have: 33

Theorem C.1 (transportation cost for shallow circuits). Let be an unknown depth L quantum U ⊗n circuit on n qubits defined on a graph G = (V,E) and ψ = 0 . Define for  > 0 HU = † e−βHU −1 | i U | i Zi and σ( , ) = with β = log( ). Then for any state ρ and all observables O − i U U U tr[e−βHU ] weP have:

tr [O( ψ ψ ρ)] O ∇ √n + D(ρ σ( , )) | | ih | − | ≤ k kLip, k U  p  with

n O 2 = [O, a †] 2 + [O, a† †] 2. k kLip,∇ k U iU k k U i U k i=1 X Proof. We have:

tr [O( ψ ψ ρ)] tr [O( ψ ψ σ( ))] + tr [O(σ( ) ρ)] . | | ih | − | ≤ | | ih | − U | | U − | The claim then follows from the discussion above, as σ( ) satisfies a transportation cost inequality 1 U with constant 2 . Of course the result above has the downside that the Lipschitz constant depends on the unknown † circuit . However, as we can estimate the locality of each term ai , it is also possible to estimate the LipschitzU constant by controlling the overlap of the observableU OUwith each a †. U iU The result of Thm. C.1 and the discussion preceding it also give us a method of efficiently learning the outputs of shallow circuits, as illustrated by the following proposition: Proposition C.1 (Learning the outputs of shallow circuits). Let be an unknown n-qubit quantum U⊗n circuit with known locality structure as in Eq. (C4) and ψ = 0 . Moreover, define l as | i U | i 0 † l0 = max supp Zi . 1≤i≤n |U U | For some  > 0 we have that

(−243l0 log4(n4l0 −1) log(4l0 nδ−1)) (C6) O samples of ψ suffice to learn a Gibbs state σ(µ) that satisfies | i W ∇( ψ ψ , σ(µ)) √n (C7) 1, | ih | ≤ with probability of success at least 1 δ. In similar lines we have that − (−243l0 n2 log4(n4l0 −1) log(4l0 nδ−1)) (C8) O samples suffice to learn a state σ(λ) such that

ψ ψ σ(λ) √. k| ih | − ktr ≤ † Proof. Let i be a basis of Pauli operators for matrices on the support of Zi . By our assumption on l , we knowE that for each i we have that 4l0 . For simplicity weU willU assume that there are 0 |Ei| ≤ no Pauli words that are contained in two distinct i and we will enumerate all different Pauli words l0 E as Ei,j for 1 i n and 1 j 4 indicating the elements of the different i. Thus, there is a λ {Rm}with m≤ n≤4l0 and λ≤ ≤ 1 such that E ∈ ≤ k k`∞ ≤ n σ( , ) exp β λ E . U ∝ −  i,j i,j i=1 ∈E X Ei,jX i   34

˜ Let ˜ > 0 be given. It follows from Eq. (C5) and Pinsker’s inequality that picking  = n16 is sufficient to ensure that ˜ ψ ψ σ( , ˜ ) . (C9) k| ih | − U n4 ktr ≤ 2 Measuring (4l0 ˜−2 log(4l0 nδ−1)) (C10) O copies of ψ is sufficient to obtain estimates of tr [ ψ ψ E ] up to an/ ˜ 4 error. By Eq. (C9) and a | i | ih | i,j triangle inequality they are also/ ˜ 2 away from tr σ( , ˜ )E . U n4 i,j Thus, running the maximum entropy principle with these estimates and the basis of operators given by n will yield us an estimate σ(µ) that satisfies: ∪i=1Ei n n D(σ(µ) σ( , ˜ )) log m˜ log ˜4l0 n. k U n ≤ 4˜ ≤ 4˜     To obtain the estimate in Waserstein distance in Eq. (C6), we can pick ˜ = (/(4l0 log2(n4l0 −1))), as in this case O log n log2(n4l0 −1)−1 n n l0 log m˜ log ˜4 n =  2 n n. 4˜ ≤ 4˜ log (n4l0 −1) ≤      The claim then follows from the results of Thm. C.1 and substituing ˜ into Eq. (C10). For the statement on the sample complexity for the trace distance, we may pick ˜ = (/(n4l0 log2(n4l0 −1))), which yields tha statement after applying Pinsker’s inequality. O

This shows that shallow circuits can be learned efficiently as long as l0 = (log(n)). However, it is not immediately obvious how to also estimate the expectation values of the GibbsO states required to run the maximum entropy algorithm. Thus, at least with the methods presented here, the postprocessing takes exponential time in the number of qubits. a. Ground states of gapped systems: in light of our results for shallow circuits, it is natural to ask to what extent our framework can be extended to ground states of gapped Hamiltonians, specially in 1D. Thus, let us briefly comment on the technical barriers on the way of such statements. First, notice that the statement of Thm C.1 required inverse temperatures scaling logarithmically in system size for the approximation of the ground state. Most known TC inequalities have an exponential scaling with the inverse temperature and, thus, TC at this inverse temperature the savings compared to Pinsker are not quadratic, hindering a straghtforward extension to gapped systems. There are some nontrivial examples of ground states satisfying a TC inequality with the constant depending inverse linearly with the temperature, like graph states [62]. But, as they can also be prepared from a constant depth quantum circuit almost by definition, they fall into the assumptions of the previous statement. However, the results of [37] assert that k-local density matrices of ground states of gapped Hamil- tonians in 1D can be approximated by constant depth circuits, giving evidence that our framework should also extend to such states. And to get there, a technical obstacle has to be overcome in the proof of Thm. C.1. Essentially we need to show that local reduced density matrices of the ground state are already well-approximated at inverse temperature log(−1). With such a statement we could −1 show that we can still approximate the expectation values of Ei at inverse temperature log( ) from samples from the ground state ψ . | i

Appendix D: Summary of known strong convexity constants

As we see in the statement of Thm. A.1 and C.1, having a bound on the strong convexity constant L−1 can give a quadratic improvement in the sample complexity in terms of the error . Here we will briefly summarize for which cases estimates on L are known in the literature for the classes of states we considered here. 35

a. General many-body quantum: first, we should mention the results of [5]. There the authors show bounds on L−1 for arbitrary many-body Hamiltonians and temperatures that scales linearly in m. Thus, although certainly nontrivial, these bounds do not improve the sample complexity for the regimes we are interested in this work, namely that of logarithmic sample complexity in system size. To obtain improvements in this regime, L−1 needs to be at most polylogarithmic in system size. In the case of high-temperature Gibbs states, the recent work of [20] shows that this is indeed the case. I.e., in their Corollary 4.4 they show that for β = (k−8), where k is the locality of the Hamiltonian, we indeed have L−1 = (β−2). It should also beO noted that their results do not hold only for geometrically local Hamiltonians,O but rather any Hamiltonian such that each term acts on at most k qudits. This implies that for the high temperature regime, for which we also have the TC inequality in Thm. (B.2), the improved samples complexity yielded by our methods holds. Note, however, that there is a slight mismatch between the inverse temperature range for which the two results hold: for the strong convexity we need β = (k−8), whereas for TC β = (k−1) suffices. O O b. Commuting Hamiltonians: as we will prove later in Prop. F.1, in the case of commuting Hamiltonians we have that L−1 = (eββ−2). Thus, for any constant inverse temperature β > 0 we have an improved sample complexity.O However, in order to analyse ground states in 1D, our current proof techniques still require an inverse temperature scaling logarithmically in system size, so for such states we do not obtain improvements through strong convexity. We plan to address this gap in future work. Besides the cases mentioned above, we also considered the case of local circuits in this work. For those there are no nontrivial estimates on L available to the best of our knowledge.

Appendix E: Regimes of efficient postprocessing

The only question we have still to answer is how to perform the postprocessing efficiently, namely how the parameter CE appearing in Theorem C.1 scales and how we obtain the bounds in Table II. There have been many recent breakthroughs on approximating quantum Gibbs states efficiently on classical computers [21–25]. The gradient descent method only requires us to estimate the gradient of the partition function for a Gibbs state at each iteration. Thus, any algorithm that can approximate the log-partition function (µ) efficiently or approximate e(λ) suffices for our purposes. Z For Gibbs states on a lattice, the methods of [22] yield that we can perform such approximations 3 on a classical computer in time polynomial in n for temperatures β < βc = 1/(k8e ), where k is the locality of the Hamiltonian, and  inverse polynomial in system size. Thus, we conclude that for this temperature range, which coincides with the range for which the results of B.2 hold, CE is polynomial in system size and we can also obtain efficient classical postprocessing. For the case of Gibbs states of 1D systems, to the best of our knowledge the best results available are those of [25]. They show how to efficiently obtain efficient tensor network approximations for 1D Gibbs states for β = o(log(n)). As such tensor networks can be contracted efficiently as well, this gives an efficient classical algorithm to compute local expectation values of such states, which suffices for our purposes. Thus, the results of [22, 25] ensure that for the systems considered in Thm. B.1 we can also perform the postprocessing efficiently on a classical computer. It is also interesting to consider what happens if we have access to a quantum computer to estimate the gradient. We will discuss the implications of this in the next section for the case of commuting Hamiltonians. Finally, for local quantum circuits we are not aware at this stage of any method that could yield a better postprocessing complexity than computing the partition function explicitely. This would yield a postprocessing that is exponential in the system size, as it requires diagonalizing an exponentially large matrix. 36

Appendix F: A complete picture: commuting Gibbs states

In this section, we discuss two classes of states for which the strongest version of our theorems holds, namely that of commuting 1D Gibbs states, and the one of high-temperature commuting Gibbs states on arbitrary hypergraphs. We already discussed that they satisfy transportation cost inequalities in Thm. B.1. Thus, the last missing ingredients to obtain an optimal performance is to show that the partition function is indeed strongly convex. More precisely, we will now establish that, for these classes of states, both the upper and lower bounds on the Hessian of the log partition function are order 1. In addition to that, with access to a quantum computer it is possible to perform the post- processing in time ˜(m). As writing down the vector λ takes Ω(m) time, we conclude that the postprocessing canO be performed in a time comparable to writing down a solution to the problem. Thus, our procedure is essentially optimal. Also in the setting of commuting Gibbs states, it is worth noting that after the completion of this work we became aware of [63], which gives another method to learn the Gibbs state and its Hamiltonian that neither involves the maximum entropy method nor requires strong convexity guarantees. Their algorithm works by learning local reduced density matrices and showing that the parameters λ of the Hamiltonian of a commuting Gibbs state can be efficiently estimated from that. In principle, obtaining a bound on λ also suffices for our purposes and we could alternatively use their methods to bypass having to solve the maximum entropy problem for such states. In particular, this means that the postprocessing with their methods could be performed even for temperatures at which the partition function cannot be estimated efficiently but we still have access to samples from the state. However, as we ultimately are interested in the regime in which TC inequalities hold, which corresponds to the high temperature regime, we do not further comment on their results. In this section, we consider a hypergraph G = (V,E) and assume that there is a maximum radius r0 N such that, for any hyperedge A E, there exists a vertex v V such that the ball B(v, r0) ∈ ∈ ∈ centered at v and of radius r0 includes A. In what follows, we also denote by S(v, r) the sphere of radius r centered at vertex v V , and define for all r N: ∈ ∈ B(r) := max B(v, r) ,S(r) := max S(v, r) . v∈V | | v∈V | | Next, we consider a Gibbs state σ(λ) on := , where dim( ) = d for all v V . More HV v∈V Hv Hv ∈ precisely, σ(λ) := e−βH(λ)/ (λ), where by a slight abuse of notations Z N m

H(λ) = λiEi = αA EαA , (F1) i=1 ∈ 2 |A| X AXE αA∈X[d ] with E ∞ = 1 for all i 1, , m ,[E ,E ] = 0 for all i, j 1, , m , and where the sets k ik ∈ { ··· } i j ∈ { ··· } 2 A 1, , m and αA A∈E,αA∈[d ] are in bijection. Note that B(r0) bounds the maximal locality of the{ ··· Hamiltonian.} { }

We also denote by σA(λ) the Gibbs state corresponding to the restriction of H onto the region A, i.e.

e−βHA(λ) σA(λ) := , where HA(λ) := λiEi . tr e−βHA(λ) i| supp(XEi)∩A=06   Note that, in general, σ (λ) = tr c (σ(λ)). A 6 A

1. Upper bound on Hessian for commuting Gibbs states with decay of correlations

In this section, we prove tightened strong convexity constants for the log partition function in the case when the Gibbs state arises from a local commuting Hamiltonian at high-temperature. In fact, 37 the upper constant found in Proposition A.4 can be tightened into a system size-independent one under the condition of exponential decay of the correlations in the Gibbs state. Several notions of exponential decay of correlations exist in the literature [26]. Here, we will say that a Gibbs state σ has correlation length ξ if for all observables OA,OB supported on non-overlapping regions A and B respectively, we have that:

−ξ dist(A,B) tr [σ O O ] tr [σ O ] tr [σ O ] c O ∞ O ∞e | A ⊗ B − A B | ≤ k Ak k Bk for some constant c > 0. In the classical setting, this condition is known to hold at any inverse temperature for 1D systems, and below a critical inverse temperature βc that depends on the locality of the Hamiltonian when D 2 [64]. The same bound also holds in the quantum case for 1D systems [65], or above a critical temperature≥ on regular lattices when D 2 [21, 23]. Using these bounds, we obtain the following improvement of Proposition A.4 which shows≥ that for this class of states U = (1). O Lemma F.1 (Strengthened upper strong convexity at high-temperature and for 1D systems). For each µ B`∞ (0, 1), let σ(µ) be a Gibbs state at inverse temperature β < βc corresponding to the Hamiltonian∈ defined on the hypergraph G = (V,E) in (F1). Then ∞ 2 log( (µ)) c β2 B(r ) B(2r ) d2B(r0) e−ξrS(r) I, ∇ Z ≤ 0 0 r=1  X  where ξ is the correlation length of the state. Moreover, this result holds for all β > 0 in 1D.

Proof. Let us first use the assumption of commutativity in order to simplify the expression for the Hessian. We find for all α [d2]|A|, α0 [d2]|B|, A, B E, A ∈ B ∈ ∈ 2 2 0 0 0 ( log( (µ)))α α = β tr σ(µ)(EαA tr [σ(µ) EαA ]) (Eα tr σ(µ) Eα ) . (F2) ∇ Z A B − B − B The rest follows similarly to Proposition A.4 from Gershgorin’s circle theorem together with  the decay of correlations arising at β < βc: for all αA,

2 2 −ξ dist(A,B) log( (µ)) α α0 c β e , ∇ Z A B ≤ α0 =6 α α0 =6 α B A B A X  X where we also used that the basis operators E have operator norm 1. The claim then follows by { i} bounding the number of basis operators whose support is at distance r of A for each r N: the latter is bounded by the product of (i) the number of vertices in A; (ii) the number of vertices∈ at distance r of a given vertex; (iii) the number of hyperedges containing a given vertex; and (iv) the number of basis operators corresponding to a given hyperedge. A simple estimate of each of these quantities 2B(r0) gives the bound B(r0) S(r) B(2r0) d . Therefore: ∞ 2 2 2B(r0) −ξr log( (µ)) α α0 c β B(r0) B(2r0) d e S(r) . ∇ Z A B ≤ α0 =6 α r=1 B A X  X

2B(r0) Note that for D-regular graphs and r0 = (1) we have B(r0) B(2r0) d = (1) and S(r) = (rD−1), giving a scaling of 2 log( (µ)) =O (β2ξD). O O ∇ Z O

2. Lower bound on Hessian for commuting Gibbs states

Whenever the Gibbs state is assumed to be commuting, the lower strong convexity constant L can also be made independent of m, by a direct generalization of the classical argument as found in [66][Lemma 7] or [67] (see also [5]). 38

Before we state and prove our result, let us introduce a few useful concepts: given a full-rank quantum state σ, we denote the weighted 2-norm of an observable Y as

1 1 2 1/2 Y := tr σ 4 Y σ 4 . k k2,σ h i and refer to the corresponding non-commutative weighted L2 space with inner product X,Y σ := 1 † 1 h i tr σ 2 X σ 2 Y as L (σ). The Petz recovery map corresponding to a subregion A V with respect 2 ⊂ tohσ(µ) is definedi as

(X) = tr (σ(µ))−1/2 tr (σ(µ)1/2Xσ(µ)1/2) tr (σ(µ))−1/2 , RA,σ(µ) A A A We will also need the notion of a conditional expectation EA with respect to the state σ(µ) into the n region A Λ (see [68, 69] for more details). For instance, one can choose EA := limn→∞ , ⊂ RA,σ(µ) where A,σ(µ) is the Petz recovery map of σ(µ). In other words, the map EA is a completely positive, R unital map that projects onto the algebra A of fixed points of A,σ(µ). This algebra is known to be expressed as the commutant [69] N R

it −it 0 A := σ(µ) ( A)σ(µ) ; t R . N { B H ∈ } −1 Moreover, the maps EA commute with the modular operator ∆σ(µ)(.) := σ(µ)(.)σ(µ) , and for any XA,ZA A and all Y ( V ), EA[XAYZA] = XAEA[Y ]ZA. The commutativity condition ∈ N ∈ B H for H(µ) implies frustration freeness of the family of conditional expectations EA A∈E: for any 2 2 2 { } X L2(σ(µ)), EA(X) + (id EA)(X) = X . ∈ k k2,σ(µ) k − k2,σ(µ) k kL2(σ(µ)) The next technical lemma is essential in the derivation of our strong convexity lower bound. With a slight abuse of notations, we use the simplified notations σx(λ) := σ{x}(λ), Hx(λ) := H{x}(λ) and so on.

Lemma F.2. Let H = j µj Ej be a local commuting Hamiltonian on the hypergraph G = (V,E) defined in (F1), each local operator Ej is further assumed to be traceless. The following holds for any x V : P ∈ (Hx) tr [σx(µ)Hx] I c(x, β) := max kRx,σx(µ) − k2,σx(µ) < 1 . µ∈B` (0,1) H tr [σ (µ)H ] I ∞ k x − x x k2,σx(µ) Proof. We first prove that X = (X) is equivalent to (X) = X . One Rx,σx(µ) kRx,σx(µ) k2,σx(µ) k k2,σx(µ) direction trivially holds. For the opposite direction, assume that X = x,σx(µ)(X). This means that X = Y + Z, with Y,Z two operators that are orthogonal in L (σ (6 µ)),R with (Y ) = Y and 2 x Rx,σx(µ) Z = 0. Now, since x,σx(µ) is self-adjoint and unital, it strictly contracts elements in the orthogonal of6 its fixed points andR we have

(X) 2 = Y 2 + (Z) 2 kRx,σx(µ) k2,σx(µ) k k2,σx(µ) kRx,σx(µ) k2,σx(µ) < Y 2 + Z 2 k k2,σx(µ) k k2,σx(µ) = X 2 , k k2,σx(µ) which contradicts the condition of equality of the norms. Now, since the map is unital, it Rx,σx(µ) suffices to prove that x,σx(µ)(Hx) = Hx, or equivalently Ex(Hx) = Hx, in order to conclude. Let us R 6 6 assume instead that equality holds. This means that, for all observables Ax supported on site x, and all t R: ∈ I [H , σ(µ)itA σ(µ)−it] = σ(µ)it[H ,A ]σ(µ)−it = 0 H = x tr (H ) . x x x x ⇒ x d ⊗ x x

However, this contradicts the fact that Hx is traceless on site x. Therefore x,σx(µ)(Hx) = Hx and the proof follows. R 6 39

We are ready to prove our strong convexity lower bound. Proposition F.1 (Strengthened lower strong convexity constant for commuting Hamiltonians). For each µ B (0, 1), let σ(µ) be a Gibbs state at inverse temperature β corresponding to the commuting ∈ `∞ Hamiltonian H(µ) = j µj Ej on the hypergraph G = (V,E) defined in (F1), where tr [EiEj] = 0 for all i = j and each local operator Ej is traceless on its support. Then the Hessian of the log-partition function6 is lower boundedP by

2 log (µ) β2 e−β(B(2r0)+2B(4r0))d−B(2r0) (1 c(β)2) I, ∇ Z ≥ − where c(β) := maxv∈V c(v, β).

Proof. We first use the assumption of commutativity in order to simplify the expression for the Hessian. As in (F2), we find

( 2 log( (µ))) = β2 tr σ(µ)(E tr [σ(µ) E ]) (E tr [σ(µ) E ]) . ∇ Z ij i − i j − j Therefore, for any linear combination H  H(λ) = λ E of the basis vectors, we have ≡ j j j P λ λ ( 2 log( (µ))) = β2 Var (H) , i j ∇ Z ij σ(µ) ij X with Var (X) := X tr [σ(µ)X] 2 . It is thus sufficient to lower bound the latter. For this, σ(µ) k − k2,σ(µ) we choose a subregion A V such that any basis element Ei has support intersecting a unique vertex in A. We lower bound the⊆ variance by

2 2 2 H tr [σ(µ)H] I (id EA)(H tr [σ(µ)H] I) = H EA[H] , (F3) k − k2,σ(µ) ≥ k − − k2,σ(µ) k − k2,σ(µ) where the first inequality follows by the L2(σ(µ)) contractivity of id EA. Now, the weighted norm − can be further simplified into a sum of local weighted norms as follows: first, for any two Ei,Ej whose supports intersect with a different vertex of A, we have

1/2 1/2 EA[Ei], EA[Ej] σ(µ) = tr σ(µ) EA[Ei]σ(µ) EA[Ej] h i h −1/2 i = tr σ(µ)∆ EA[Ei]EA[Ej] σ(µ) ◦ h i = tr [σ(µ)EA[Ei]EA[Ej]] (F4)

−1 where in the third line we used the commutation of the modular operator ∆σ(µ)(X) := σ(µ)Xσ(µ) with EA together with the commutativity of σ(µ) and Ei. Now, denoting supp(Ei) A = x and supp(E ) A = y , we show that ∩ { } j ∩ { }

EA[Ei] = Ex[Ei] and EA[Ej] = Ey[Ej] . (F5)

In order to prove these two identities, we simply need to prove for instance that Ex[Ei] belongs to the image algebra A of EA since A x by definition. Hence, it is enough to show that Ex[Ei] N N ⊆ Nit −it commutes with operators of the form σ(µ) XAσ(µ) for any t R and any XA ( A). This claim follows from ∈ ∈ B H

it −it it −it it −it Ex[Ei]σ(µ) XAσ(µ) = σ(µ) σ(µ) Ex[Ei]σ(µ) XAσ(µ) it −it it −it = σ(µ) Ex[σ(µ) Eiσ(µ) ]XAσ(µ) it −it = σ(µ) Ex[Ei]XAσ(µ) it −it = σ(µ) XAEx[Ei]σ(µ) it −it = σ(µ) XAσ(µ) Ex[Ei] . 40 where the fourth line follows from the fact that the support of Ex[Ei] does not intersect A i , together \{ } with the fact that Ex[Ei] is locally proportional to I on site x, by definition of x. Therefore, using (F5) into (F4), we get N

EA[Ei], EA[Ej] σ(µ) = tr [σ(µ)Ex[Ei]Ey[Ej]] h i = tr [σ(µ)ExEy[EiEj]] = tr [σ(µ)EiEj] = E ,E , h i jiσ(µ) where in the second line we used that Ex[Ei] is a fixed point of Ey, and then that Ej is a fixed point of Ex, by the support conditions of Ei and Ej. Therefore, the variance on the right hand side of (F3) can be simplified as

2 2 H EA[H] = λj (Ej Ex[Ej]) k − k2,σ(µ) − 2,σ(µ) x∈A j| supp(E )3x X X j 2 = Hx Ex[Hx] − 2,σ(µ) x∈A X where we recall that Hx := λj Ej. Now, for any x V , we denote x∂ := A E x A j| supp(Ej )3x ∈ { ∈ | ∈ } and decompose the Hamiltonian H(µ) as P supp(K1(µ)) x∂ = 0 1 x ∩ ∅ H(µ) = Hx(µ) + Kx(µ) + Kx(µ) , where 0 . (F6) ( supp(Kx(µ)) x∂ = ∩ 6 ∅ Clearly,

1 −βKx(µ) − k 0 k e σ(µ) σ (µ) e 2β Kx(µ) ∞ . (F7) x −βK1(µ) ≥ trx∂c e x Now,  

2 2 Hx Ex[Hx] = tr σ(µ)(Hx Ex[Hx]) k − k2,σ(µ) − 1 −βKx(µ) − k 0 k e  e 2β Kx(µ) ∞ tr σ (µ) (H [H ])2 x −βK1(µ) x Ex x ≥ " trx∂c e x − # 0 −2β kK (µ)k∞ 2 = e x Hx Ex[Hx]   k − k2,σx(µ) 0 −2β kK (µ)k∞ 2 2 = e x Hx tr [σx(µ)Hx] I Ex[Hx] tr [σx(µ)Hx] I k − k2,σx(µ) − k − k2,σx(µ) − k 0 k e 2β Kx(µ) ∞ H tr [σ (µ)H ] I 2 [H ] tr [σ (µ)H ] I 2  ≥ k x − x x k2,σx(µ) − kRx,σ(µ) x − x x k2,σx(µ) − k 0 k (1 c(β)2) e 2β Kx(µ) ∞ H tr [σ (µ)H ] I 2 .  ≥ − k x − x x k2,σx(µ) The first and second identities above follow once again from the commutativity of the Hamiltonian 1 similarly to (F4), where for the second one we also use the disjointness of x∂ and supp(Kx(µ)). The first inequality follows from (F7). The third identity is a consequence of the fact that Ex is a projection with respect to L2(σx(µ)). The second inequality follows from x,σ(µ)[X] 2,σ(µ) Ex[X] 2,σ(µ) for all X (see Proposition 10 in [68]). The last inequality is a consequencekR ofk Lemma≥ k F.2. Finally,k we further bound the weighted L2 norm on the last line of the above inequality in terms of the Schatten 2 norm in order to get

0 2 −2βkKx(µ)k∞ 2 Varσ(µ)(H) (1 c(β) ) min e λmin(σx(µ)) Hx tr [σx(µ) Hx] 2 (F8) ≥ − x∈V k − k x∈A 0 X 2 −2βkKx(µ)k∞ 2 (1 c(β) ) min e λmin(σx(µ)) λi , (F9) ≥ − x∈V i X 41 where λmin(σx(µ)) denotes the minimum eigenvalue of σx(µ). The result follows from the simple 0 −βB(2r0) −B(2r0) estimates K (µ) ∞ B(4r ) and λ (σ (µ)) e d . k x k ≤ 0 min x ≥

3. Summary for 1D or high-temperature commuting

Now we genuinely have all the elements in place to essentially give the last word on estimating Lipschitz observables for Gibbs states of nearest neighbor 1D or high-temperature commuting Gibbs states.

Theorem F.1. For each µ B`∞ (0, 1), let σ(µ) be a Gibbs state at inverse temperature β corre- ∈ m sponding to the commuting Hamiltonian H(µ) = j=1 µj Ej on the hypergraph G = (V,E) defined in (F1), where tr [EiEj] = 0 for all i = j, each local operator Ej is traceless on its support and acts on a constant number of qubits and m6 = (n). Moreover,P assume that σ(λ) satisfies the conditions of Thm. B.1. Then (log(n)−2) samplesO of σ(λ) suffice to obtain a µ B (0, 1) satisfying O ∈ `∞

tr [O(σ(λ) σ(µ))] = (√n O ∇). − O k kLip, with probability at least 1 p. Moreover, we can find µ in (poly(n, −1)) time on a classical computer. − O With access to a quantum computer, the postprocessing can be performed in ˜(n−2) time by only O implementing quantum circuits of ˜(1) depth. O Proof. To obtain the claim on the sample complexity, note that for such systems L = Ω(1) by Prop. F.1 and they satisfy a transportation cost inequality by Thm. B.1. Moreover, we can learn the expectation −1 −2 of all Ei up to an error  and failure probability δ with (log(nδ ) ) samples using shadows, as they all have constant support. The claimed sample complexityO then follows from Thm. C.1. The classical postprocessing result follows from [22]. The postprocessing with a quantum computer follows from the results of [68], [26] combined with the fact that L−1U = (1) by also invoking Lemma F.1. Indeed, [26, 68] asserts that we can approximately prepare any σ(Oµ) on a quantum computer with a circuit of depth ˜(1). Moreover, once again resorting to shadows we can estimate e(µ) with ˜(−2) O O samples. We conclude that we can run each iteration of gradient descent in ˜(n−2) time. As O L−1U = (1), Thm. A.1 asserts that we converge after ˜(1) iterations, which yields the claim O O The theorem above nicely showcases how the joint use of transportation cost inequalities and strong convexity bounds to improve maximum entropy methods come together. Moreover, with access to a quantum computer, up to polylogarithmic overhead in system size factors, the computational com- plexity of learning the Gibbs state is comparable to reading out the results of measurements on the system. Together with the polylog sample complexity bounds that we obtain, this justifies us claiming that the result above almost gives the last word on learning such states.