Information-theoretic bounds on quantum advantage in machine learning
Hsin-Yuan Huang,1, 2 Richard Kueng,3 and John Preskill1, 2, 4, 5 1Institute for Quantum Information and Matter, Caltech, Pasadena, CA, USA 2Department of Computing and Mathematical Sciences, Caltech, Pasadena, CA, USA 3Institute for Integrated Circuits, Johannes Kepler University Linz, Austria 4Walter Burke Institute for Theoretical Physics, Caltech, Pasadena, CA, USA 5AWS Center for Quantum Computing, Pasadena, CA, USA (Dated: April 5, 2021) We study the performance of classical and quantum machine learning (ML) models in predicting outcomes of physical experiments. The experiments depend on an input parameter x and involve execution of a (possibly unknown) quantum process E. Our figure of merit is the number of runs of E required to achieve a desired prediction performance. We consider classical ML models that perform a measurement and record the classical outcome after each run of E, and quantum ML models that can access E coherently to acquire quantum data; the classical or quantum data is then used to predict outcomes of future experiments. We prove that for any input distribution D(x), a classical ML model can provide accurate predictions on average by accessing E a number of times comparable to the optimal quantum ML model. In contrast, for achieving accurate prediction on all inputs, we prove that exponential quantum advantage is possible. For example, to predict expectations of all Pauli observables in an n-qubit system ρ, classical ML models require 2Ω(n) copies of ρ, but we present a quantum ML model using only O(n) copies. Our results clarify where quantum advantage is possible and highlight the potential for classical ML models to address challenging quantum problems in physics and chemistry.
I. INTRODUCTION on the parameter x, it produces the quantum state E(|xihx|). Finally, the experimentalist measures a cer- The widespread applications of machine learning tain observable O at the end of the experiment. The (ML) to problems of practical interest have fueled in- goal is to predict the measurement outcome for new terest in machine learning using quantum platforms physical experiments, with new values of x that have [20, 47, 79]. Though many potential applications of not been encountered during the training process. quantum ML have been proposed, so far the prospect Motivated by these concrete applications, we want for quantum advantage in solving purely classical to understand the power of classical and quantum problems remains unclear [10, 41, 85, 86]. On the ML models in predicting functions of the form given other hand, it seems plausible that quantum ML in Equation (1). On the one hand, we consider can be fruitfully applied to problems faced by quan- classical ML models that can gather classical mea- NC tum scientists, such as characterizing the properties surement data {(xi, oi)}i=1, where oi is the outcome of quantum systems and predicting the outcomes of when we perform a POVM measurement on the state quantum experiments [6, 28, 30, 40, 67, 81, 88]. E(|xiihxi|). We denote by NC the number of such ex- Here we focus on an important class of learning periments performed during training in the classical problems motivated by quantum mechanics. Namely, ML setting. On the other hand, we consider quantum we are interested in predicting functions of the form ML models in which multiple runs of the CPTP map E can be composed coherently to collect quantum data, f(x) = tr(O E(|xihx|)), (1) and predictions are produced by a quantum computer with access to the quantum data. We denote by NQ where x is a classical input, E is an arbitrary (possi- the number of times E is used during training in the bly unknown) completely positive and trace preserv- quantum setting. The classical and quantum ML set- ing (CPTP) map, and O is a known observable. Equa- tings are illustrated in Figure 1. tion (1) encompasses any physical process that takes a We focus on the question of whether quantum ML classical input and produces a real number as output. can have a large advantage over classical ML: to arXiv:2101.02464v2 [quant-ph] 2 Apr 2021 The goal is to construct a function h(x) that accu- achieve a small prediction error, can the optimal NQ rately approximates f(x) after accessing the physical in the quantum ML setting be much less than the process E as few times as possible. optimal NC in the classical ML setting? For the pur- A particularly important special case of setup (1) pose of this comparison, we disregard the runtime of is training an ML model to predict what would hap- the classical or quantum ML models that generate the pen in physical experiments [67]. Such experiments predictions; we are only interested in how many times might explore, for instance, the outcome of a reaction the process E must run during the learning phase in in quantum chemistry [96], ground state properties of the quantum and classical settings. a novel molecule or material [17, 27, 40, 59, 73, 74, 92], Our first main result addresses small average pre- or the behavior of neutral atoms in an analog quan- diction error, i.e. the prediction error |h(x) − f(x)|2 tum simulator [19, 25, 63]. In these cases, the input averaged over some specified input distribution D(x). x subsumes parameters that characterize the process, We rigorously show that, for any E, O, and D, and e.g., chemicals involved in the reaction, a description for any quantum ML model, one can always design a of the molecule, or the intensity of lasers that con- classical ML model achieving a similar average pre- trol the neutral atoms. The map E characterizes a diction error such that NC is larger than NQ by at quantum evolution happening in the lab. Depending worst a small polynomial factor. Hence, there is no 2
Prediction model stored in Prediction model stored in classical memory quantum memory
Classical processing Quantum processing … …
Measurement
Physical Physical experiments experiments (CPTP map) (CPTP map) Entangled
Classical processing Quantum processing Classical output Coherent quantum state output Measurement
Physical Physical experiments experiments (CPTP map) Entangled (CPTP map)
Coherent quantum Classical input Classical processing Quantum processing state input
Classical Quantum Machine Learning Machine Learning
Figure 1: Illustration of classical and quantum machine learning settings: The goal is to learn about an unknown CPTP map E by performing physical experiments. (Left) In the learning phase of the classical ML setting, a measurement is performed after each query to E; the classical measurement outcomes collected during the learning phase are consulted during the prediction phase. (Right) In the learning phase of the quantum ML setting, multiple queries to E may be included in a single coherent quantum circuit, yielding an output state stored in a quantum memory; this stored quantum state is consulted during the prediction phase.
exponential advantage of quantum ML over classical maps F. Apart from E ∈ F, the process can be ar- ML if the goal is to achieve a small average prediction bitrary — a common assumption in statistical learn- error, and if the efficiency is quantified by the number ing theory [12, 15, 21, 87, 89]. For the sake of con- of times E is used in the learning process. This state- creteness, we assume that E is a CPTP map from ment holds for existing quantum ML models running a Hilbert space of n qubits to a Hilbert space of m on near-term devices [47, 53, 79] and future quantum qubits. Regarding inputs, we consider bit-strings of n ML models yet to be conceived. We note, though, size n: x ∈ {0, 1} . This is not a severe restriction, that while there is no large advantage in query com- since floating-point representations of continuous pa- plexity, a substantial quantum advantage in compu- rameters can always be truncated to a finite number tational complexity is possible [80]. of digits. We now give precise definitions for classical However, the situation changes if the goal is to and quantum ML settings; see Fig. 1 for an illustra- achieve a small worst-case prediction error rather tion. than a small average prediction error — an exponen- a. Classical (C) ML: The ML model consists of tial separation between NC and NQ becomes possible two phases: learning and prediction. During the if we insist on predicting f(x) = tr(O E(|xihx|)) accu- learning phase, a randomized algorithm selects clas- rately for every input x. We illustrate this point with sical inputs xi and we perform a (quantum) exper- an example: accurately predicting expectation values iment that results in an outcome oi from perform- of Pauli observables in an unknown n-qubit quantum ing a POVM measurement on E(|xiihxi|). A total state ρ. This is a crucial subroutine in many quan- of NC experiments give rise to the classical training NC tum computing applications; see e.g. [34, 52, 54, 56– data {(xi, oi)}i=1. After obtaining this training data, 58, 61, 74]. We present a quantum ML model that the ML model executes a randomized algorithm A to uses NQ = O(n) copies of ρ to predict expecta- learn a prediction model tion values of all n-qubit Pauli observables. In con- s = A ({(x , o ),... (x , o )}) , trast we prove that any classical ML model requires C 1 1 NC NC (2) Ω(n) NC = 2 copies of ρ to achieve the same task even if where sC is stored in the classical memory. In the pre- the ML model can perform arbitrary adaptive single- diction phase, a sequence of new inputs x˜1, x˜2,... ∈ copy POVM measurements. n {0, 1} is provided. The ML model will use sC to eval- uate predictions hC(˜x1), hC(˜x2),... that approximate f(˜x1), f(˜x2),... up to small errors. II. MACHINE LEARNING SETTINGS b. Restricted classical ML: We will also consider a restricted version of the classical setting. Rather We assume that the observable O (with kOk ≤ 1) than performing arbitrary POVM measurements, we is known and the physical experiment E is an un- restrict the ML model to measure the target observ- known CPTP map that belongs to a set of CPTP able O on the output state E |xiihxi| to obtain the 3
measurement outcome oi. In this case, we always have Then there is an ML model in the restricted classi- oi ∈ R and E[oi] = tr(O E(|xiihxi|)). cal setting which accesses E NC = O(mNQ/) times c. Quantum (Q) ML: During the learning phase, and produces with high probability a function hC that the model starts with an initial state ρ0 in a Hilbert achieves space of arbitrarily high dimension. Subsequently, the X 2 quantum ML model accesses the unknown CPTP map D(x) |hC(x) − tr(OE(|xihx|))| = O(). (6) x∈{0,1}n E a total of NQ times. These queries are interleaved with quantum data processing steps: Proof sketch. The proof consists of two parts. First, we cover the entire set of CPTP maps F with a maxi- ρE = CNQ (E ⊗I)CNQ−1 ... C1(E ⊗I)(ρ0), (3) |S| mal packing net, i.e. the largest subset S = {Es}s=1 ⊂
where each Ci is an arbitrary but known CPTP map, F such that the functions fEs (x) = tr(O Es(|xihx|)) P 2 and we write E ⊗I to emphasize that E acts on an n obey x∈{0,1} D(x) fEs (x) − fEs0 (x) > 4 when- n-qubit subsystem of a larger quantum system. The ever s 6= s0. We then set up a communication pro- final state ρE , encoding the prediction model learned tocol as follows. Alice chooses an element s of the from the queries to the unknown CPTP map E, is packing net uniformly at random, records her choice stored in a quantum memory. In the prediction phase, s E N n , and then applies s Q times to prepare a quan- a sequence of new inputs x˜1, x˜2,... ∈ {0, 1} is pro- tum state ρEs as in Eq. (3). Alice’s random ensemble vided. A quantum computer with access to the stored of quantum states is thus given by quantum state ρE executes a computation to pro- h (˜x ), h (˜x ),... 1 duce prediction values Q 1 Q 2 that ap- ρEs with probability ps = |S| (7) 1 proximate f(˜x1), f(˜x2),... up to small errors . The quantum ML setting is strictly more powerful for s = 1,..., |S|. Alice then sends the randomly than the classical ML setting. During the prediction sampled quantum state ρEs to Bob, hoping that Bob phase, classical ML models are restricted to process- can decode the state ρEs to recover her chosen mes- ing classical data, albeit data obtained by measuring sage s. Using the quantum ML model, Bob can pro- a quantum system during the learning phase. In con- duce the function hQ,s(x). Because by assumption trast, quantum ML models can work directly with the the function hQ,s(x) achieves a small average-case pre- quantum data and perform quantum data processing. diction error with high probability, and because the A quantum ML model can have an exponential advan- packing net has been constructed so that the func- tage relative to classical ML models for some tasks, tions {fEs } are sufficiently distinguishable, Bob can as we demonstrate in Sec. IV. determine s successfully with high probability. Be- cause Alice chose from among |S| possible messages, the mutual information of the chosen message s and III. AVERAGE-CASE PREDICTION ERROR Bob’s measurement outcome must be at least of or- der log |S| bits. According to Holevo’s theorem, the For a prediction model h(x), we consider the Holevo χ quantity of Alice’s ensemble Eq. (7) upper average-case prediction error bounds this mutual information, and therefore must also be χ = Ω(log |S)|. Furthermore, we can analyze X D(x)|h(x) − tr(O E(|xihx|))|2, (4) how χ depends on NQ, finding that each additional x∈{0,1}n application of Es can increase χ by at most O(m). We conclude that χ = O(mNQ), yielding the lower bound with respect to a fixed distribution D over inputs. NQ = Ω(log(|S|)/m). The lower bound applies to any This could, for instance, be the uniform distribution. quantum ML model, where the size |S| of the pack- Although learning from quantum data is strictly ing net depends on the average-case prediction error more powerful than learning from classical data, there . This completes the first part of the proof. are fundamental limitations. The following rigorous In the second part, we explicitly construct an ML statement limits the potential for quantum advantage. model in the restricted classical setting that achieves a small average-case prediction error using a modest Theorem 1. Fix an n-bit probability distribution D, number of experiments. In this ML model, an input an m-qubit observable O (kOk ≤ 1) and a set F of xi is selected by sampling from the probability distri- CPTP maps with n input qubits and m output qubits. bution D, and an experiment is performed in which Suppose there is a quantum ML model which accesses the observable O is measured in the output quantum the map E ∈ F N times, producing with high proba- Q state E(|xiihxi|), obtaining measurement outcome oi bility a function h (x) that achieves Q which has expectation value tr(O E(|xiihxi|)). A to- NC X 2 tal of such experiments are conducted. Then, the D(x) |hQ(x) − tr(OE(|xihx|))| ≤ . (5) ML model minimizes the least-squares error to find x∈{0,1}n the best fit within the aforementioned maximal pack- ing net S:
NC 1 X 2 1 hC = arg min |f(xi) − oi| . (8) Due to non-commutativity of quantum measurements, the N ordering of new inputs matters. For instance, the two lists f∈S C i=1 x˜1, x˜2 and x˜2, x˜1 can lead to different outcome predictions hQ(˜xi). Our main results do not depend on this subtletey — Because the measurement outcome oi fluctuates they are valid, irrespective of prediction input ordering. about the expectation value of O, it may be impossi- 4
ble to achieve zero training error. Yet it is still possi- has two stages. The goal of the first stage is to pre- ble for hC to achieve a small average-case prediction dict the absolute value | tr(Pxρ)| for each x, and the error, potentially even smaller than the training er- goal of the second stage is to determine the sign of ror. We use properties of maximal packing nets and tr(Pxρ). The key idea used in the first stage is that, of quantum fluctuations of measurement outcomes to although two different Pauli operators Px and Py may perform a tight statistical analysis of the average- either commute or anticommute, the tensor products case prediction error, finding that with high probabil- Px ⊗ Px and Py ⊗ Py are mutually commuting for P 2 ity x∈{0,1}n D(x) |hC(x) − tr(OE(|xihx|))| = O(), all x and y. Therefore, although it is not possible provided that NC is of order log(|S|)/. to measure anticommuting Pauli operators simulta- Finally, we combine the two parts to conclude NC = neously using a single copy of the state ρ, it is possi- O(mNQ/). The full proof is in Appendix C. ble to measure Px ⊗ Px simultaneously for all x using two copies of ρ. Indeed, all 4n expectation values 2 Theorem 1 shows that all problems that are ap- tr((Px ⊗ Px)(ρ ⊗ ρ)) = tr(Pxρ) can be determined by proximately learnable by a quantum ML model are measuring pairs of qubits in the Bell basis, which is also approximately learnable by some restricted clas- highly efficient. This completes the first stage. sical ML model which executes the quantum pro- If | tr(Pxρ)| is found to be small in the first stage, cess E a comparable number of times. This ap- we may predict h(x) = 0 and be assured that the pre- plies in particular, to predicting outputs of quantum- diction error is small. Therefore, in the second stage, mechanical processes. The relation NC = O(mNQ/) we need only determine the sign if | tr(Pxρ)| was found is tight. We give an example in Appendix D with to be reasonably large in the first stage. In that case NC = Ω(mNQ/). we can perform a coherent measurement across sev- For the task of learning classical Boolean circuits, eral copies of ρ which performs a majority vote and fundamental limits on quantum advantage have been yields the correct value of the sign with high success established in previous work [11–13, 31, 80, 94]. The- probability. Because the measurement is strongly bi- orem 1 generalizes these existing results to the task of ased in favor of one of the two possible outcomes, it learning outcomes of quantum processes. introduces only a very “gentle” disturbance of the pre- measurement state. Therefore, by performing many such measurements in succession on the same quan- IV. WORST-CASE PREDICTION ERROR tum memory register, we can determine the sign of tr(Pxρ) for many different values of x. The second stage can also be more amenable to near-term imple- Rather than achieving a small average prediction mentation using a heuristic that groups commuting error, one may be interested in obtaining a prediction observables [56, 57]; see Appendix E 2 d for further model that is accurate for all inputs x ∈ {0, 1}n. For discussion. Each of the two stages requires only a a prediction model h(x), we consider the worst-case small number of copies of ρ; a careful analysis yields prediction error to be the following theorem. max |h(x) − tr(O E(|xihx|))|2. (9) x∈{0,1}n Theorem 2. The quantum ML model only needs 4 NQ = O(log(M/δ)/ ) copies of ρ to predict expec- Under such a stricter performance requirement, expo- tation values of any M Pauli observables to error nential quantum advantage becomes possible. with probability at least 1 − δ. We highlight this potential by means of an illustra- More details regarding the quantum ML model, as tive and practically relevant example: predicting ex- well as a rigorous proof, are provided in Appendix E 2. pectation values of Pauli operators in an unknown n- The sample complexity stated in Theorem 2 improves qubit quantum state ρ. This is a central task for many upon previously known shadow tomography protocols quantum computing applications [34, 52, 54, 56– [4, 5, 26, 54] for the special case of predicting Pauli 58, 61, 74]. To formulate this problem in our frame- observables; see Appendix A. Because each access to work, suppose the 2n-bit input x specifies one of the n ⊗n Eρ allows us to obtain one copy of ρ, we only need 4 n-qubit Pauli operators Px ∈ {I,X,Y,Z} , and n NQ = O(n) to predict expectation values of all 4 suppose that Eρ(|xihx|) prepares the unknown state ρ Pauli observables up to a constant error. and maps Px to the fixed observable O, which is then For classical ML models, we prove the following fun- measured; hence damental lower bound; see Appendix E 4. f(x) = tr(O E (|xihx|)) = tr(P ρ). Ω(n) ρ x (10) Theorem 3. Any classical ML must use NC ≥ 2 copies of ρ to predict expectation values of all Pauli In this setting, according to Theorem 1, there is no observables up to a small error with a constant success large quantum advantage if our goal is to estimate the probability. Pauli operator expectation values with a small aver- age prediction error. However, an exponential quan- This theorem holds even when the POVM measure- tum advantage is possible if we insist on accurately ments performed by the classical ML model could de- predicting every one of the 4n Pauli observables. pend on the previous POVM measurement outcomes First we show there is an efficient quantum ML adaptively. When combined with Theorem 2, Theo- model that achieves a small prediction error. De- rem 3 establishes an exponential gap separating clas- tails are in Appendix E 2; here we just sketch the sical ML models from fully quantum ML models. Ta- main ideas. The procedure for predicting tr(Pxρ) ble 2 provides a summary of the upper and lower 5
Model Upp. bd. Low. bd. Quantum ML O(n) Ω(n) Classical ML O(n2n) 2Ω(n) Restricted CML O(22n) Ω(22n)
Table 2: Sample complexity for predict- ing expectations of all 4n Pauli observ- ables (worst-case prediction error) in an n-qubit quantum state. Upp. bd. is the achievable sample complexity of a spe- cific algorithm. Low. bd. is the lower bound for any algorithm. The classical ML upper bound can be achieved using Figure 3: Numerical experiments – number of copies of an unknown n-qubit classical shadows based on random Clif- n ford measurements [54]. The rest of the state needed for predicting expectation values of all 4 Pauli observables, bounds are obtained in Appendix E. with constant worst-case prediction error. Mixed states: Quantum states of the form (I + P )/2n, where P is an n-qubit Pauli observable. Product states: Tensor products of single-qubit stabilizer states.
bounds on the sample complexity for predicting ex- VI. CONCLUSION AND OUTLOOK pectation values of Pauli observables. We have studied the task of learning functions of the form Equation (1), using as a figure of merit the number of runs of E. Our main result Theorem 1 V. NUMERICAL EXPERIMENTS shows that, when the objective is achieving a spec- ified average prediction error, a classical ML model can perform as well as a quantum ML model, using We support our theoretical findings with numeri- a comparable number of runs of E. This result es- cal experiments, focusing on the task of predicting tablishes a fundamental limit on quantum advantage 4n the expectation values of all Pauli observables in in machine learning that holds for any quantum ML n ρ an unknown -qubit quantum state , with a small model [47, 53, 79]. worst-case prediction error. In this case, the func- From a different perspective, Theorem 1 means that f(x) = tr(O E (|xihx|)) = tr(P ρ) tion is ρ x , where the classical ML setting, in which a measurement is x ∈ {I,X,Y,Z}n indexes the Pauli observables, and performed after each query to E, can be surprisingly E ρ P ρ prepares the unknown state then maps x to the effective. The quantum ML setting, in which multiple O fixed observable . This is the task we considered in queries to E can be included in a single coherent quan- Section IV. Note that average-case prediction of Pauli tum circuit, is far more challenging and may be infea- observables is a much easier task, because most of the sible until far in the future. Therefore finding that 4n n expectation values are exponentially small in . classical and quantum ML have comparable power We consider two classes of underlying states ρ: (i) (for average-case prediction) boosts our hopes that n Mixed states: ρ = (I + P )/2 , where P is a ten- the combination of classical ML and near-term quan- sor product of n Pauli operators. States in this class tum algorithms [47, 52, 53, 76, 79] may fruitfully ad- n−1 Nn have rank 2 . (ii) Product states: ρ = i=1 |siihsi|, dress challenging quantum problems in physics, chem- where each |sii is one the six possible single-qubit sta- istry, and materials science. bilizer states. We consider stabilizer states to ensure On the other hand, Theorem 2 and 3 rigorously es- that classical simulation of the quantum ML model is tablish that quantum ML can have an exponential ad- tractable for reasonably large system size. vantage over classical ML for certain problems where The numerical experiment in Figure 3 implements the objective is achieving a specified worst-case pre- the best-known ML procedures. We can clearly see diction error. This exponential advantage of quan- that there is an exponential separation between the tum ML over classical ML may be viewed as an ex- number of copies of the state ρ required for classical ponential separation between coherent measurements and quantum ML to predict expectation values when (in which a measurement apparatus interacts coher- ρ is in the class of mixed states. However, for the ently multiple times with a measured system, storing class of product states, the separation is much less quantum data which is then processed by a quantum pronounced. Restricted classical ML can only obtain computer) and incoherent measurements (in which a outcomes oi ∈ {±1} with E[oi] = tr(Pxi ρ). Hence POVM measurement is performed and the outcome each copy of ρ provides at most one bit of informa- recorded after each interaction between system and tion, and therefore O(n) copies are needed to pre- apparatus, and the classical measurement outcomes dict expectation values of all 4n Pauli observables. In are then processed by a classical computer). Such a contrast, standard classical ML can perform arbitrary separation has been challenging to establish because POVM measurements on the state ρ, so each copy can incoherent measurements are difficult to analyze in provide up to n bits of information. The separation the adaptive setting, where each measurement per- between classical ML and quantum ML is marginal formed may depend on the outcomes of all previous for product states. measurements. Our proof technique overcomes this 6 challenge, enabling us to identify tasks which allow inspiring discussions. We would also like to thank substantial quantum advantage. An important future anonymous reviewers for in-depth comments and sug- direction will be identifying further learning problems gestions. HH is supported by the J. Yang & Fam- which allow substantial quantum advantage, point- ily Foundation. JP acknowledges funding from the ing toward potential practical applications of quan- U.S. Department of Energy Office of Science, Of- tum technology. fice of Advanced Scientific Computing Research, (DE- NA0003525, DE-SC0020290), and the National Sci- Acknowledgments: ence Foundation (PHY-1733907). The Institute for Quantum Information and Matter is an NSF Physics The authors thank Victor Albert, Sitan Chen, Jerry Frontiers Center. Li, Seth Lloyd, Jarrod McClean, Spiros Michalakis, Yuan Su, and Thomas Vidick for valuable input and
[1] S. Aaronson. Qma/qpoly/spl sube/pspace/poly: de- Cambridge university press, 2017. merlinizing quantum protocols. In CCC, pages 13–pp, [19] H. Bernien, S. Schwartz, A. Keesling, H. Levine, 2006. A. Omran, H. Pichler, S. Choi, A. S. Zibrov, M. En- [2] S. Aaronson. The learnability of quantum states. Pro- dres, M. Greiner, et al. Probing many-body dy- ceedings of the Royal Society A: Mathematical, Phys- namics on a 51-atom quantum simulator. Nature, ical and Engineering Sciences, 463(2088):3089–3114, 551(7682):579–584, 2017. 2007. [20] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, [3] S. Aaronson. Read the fine print. Nat. Phys., N. Wiebe, and S. Lloyd. Quantum machine learning. 11(4):291–293, 2015. Nature, 549(7671):195–202, 2017. [4] S. Aaronson. Shadow tomography of quantum states. [21] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. In STOC, pages 325–338, 2018. Warmuth. Learnability and the vapnik-chervonenkis [5] S. Aaronson and G. N. Rothblum. Gentle measure- dimension. J. ACM, 36(4):929–965, 1989. ment of quantum states and differential privacy. In [22] X. Bonet-Monroig, R. Babbush, and T. E. O’Brien. STOC, pages 322–333, 2019. Nearly optimal measurement scheduling for partial [6] D. Aharonov, J. Cotler, and X.-L. Qi. Quan- tomography of quantum states. Phys. Rev. X, tum algorithmic measurement. arXiv preprint 10(3):031064, 2020. arXiv:2101.04634, 2021. [23] F. G. Brandão, A. Kalev, T. Li, C. Y.-Y. Lin, K. M. [7] N. Alon, S. Ben-David, N. Cesa-Bianchi, and Svore, and X. Wu. Quantum SDP solvers: Large D. Haussler. Scale-sensitive dimensions, uniform con- speed-ups, optimality, and applications to quantum vergence, and learnability. Journal of the ACM learning. In ICALP, 2019. (JACM), 44(4):615–631, 1997. [24] S. Bubeck, S. Chen, and J. Li. Entanglement is nec- [8] A. Ambainis and J. Emerson. Quantum t-designs: t- essary for optimal quantum property testing. arXiv wise independence in the quantum world. In CCC, preprint arXiv:2004.07869, 2020. pages 129–140, 2007. [25] I. Buluta and F. Nori. Quantum simulators. Science, [9] H. Araki and E. H. Lieb. Entropy inequalities. In 326(5949):108–111, 2009. Inequalities, pages 47–57. Springer, 2002. [26] C. Bădescu and R. O’Donnell. Improved quantum [10] J. M. Arrazola, A. Delgado, B. R. Bardhan, and data analysis. arXiv preprint arXiv:2011.10908, 2020. S. Lloyd. Quantum-inspired algorithms in practice. [27] R. Car and M. Parrinello. Unified approach for molec- arXiv preprint arXiv:1905.10415, 2019. ular dynamics and density-functional theory. Phys. [11] S. Arunachalam and R. de Wolf. Optimal quan- Rev. Lett., 55(22):2471, 1985. tum sample complexity of learning algorithms. arXiv [28] G. Carleo and M. Troyer. Solving the quantum many- preprint arXiv:1607.00932, 2016. body problem with artificial neural networks. Science, [12] S. Arunachalam and R. de Wolf. Guest column: A 355(6325):602–606, 2017. survey of quantum learning theory. ACM SIGACT [29] M. C. Caro. Binary classification with classi- News, 48(2):41–67, 2017. cal instances and quantum labels. arXiv preprint [13] S. Arunachalam, A. B. Grilo, and H. Yuen. Quan- arXiv:2006.06005, 2020. tum statistical query learning. arXiv preprint [30] J. Carrasquilla and R. G. Melko. Machine learning arXiv:2002.08240, 2020. phases of matter. Nat. Phys., 13(5):431–434, 2017. [14] P. L. Bartlett and P. M. Long. Prediction, learning, [31] K.-M. Chung and H.-H. Lin. Sample efficient algo- uniform convergence, and scale-sensitive dimensions. rithms for learning quantum channels in pac model Journal of Computer and System Sciences, 56(2):174– and the approximate state discrimination problem. 190, 1998. arXiv preprint arXiv:1810.10938, 2018. [15] P. L. Bartlett and S. Mendelson. Rademacher and [32] I. Cong, S. Choi, and M. D. Lukin. Quantum convolu- gaussian complexities: Risk bounds and structural re- tional neural networks. Nat. Phys., 15(12):1273–1278, sults. J Mach Learn Res, 3(Nov):463–482, 2002. 2019. [16] R. Beals, H. Buhrman, R. Cleve, M. Mosca, and [33] J. Cotler, S. Choi, A. Lukin, H. Gharibyan, T. Grover, R. De Wolf. Quantum lower bounds by polynomials. M. E. Tai, M. Rispoli, R. Schittko, P. M. Preiss, A. M. J. ACM, 48(4):778–797, 2001. Kaufman, et al. Quantum virtual cooling. Phys. Rev. [17] A. D. Becke. A new mixing of hartree–fock and X, 9(3):031013, 2019. local density-functional theories. J. Chem. Phys., [34] O. Crawford, B. van Straaten, D. Wang, T. Parks, 98(2):1372–1377, 1993. E. Campbell, and S. Brierley. Efficient quantum [18] I. Bengtsson and K. Życzkowski. Geometry of quan- measurement of pauli operators in the presence of fi- tum states: an introduction to quantum entanglement. nite sampling error. arXiv preprint arXiv:1908.06942, 7
2020. N. C. Rubin, S. Boixo, K. B. Whaley, R. Babbush, [35] V. Dunjko and H. J. Briegel. Machine learning & ar- and J. R. McClean. Virtual distillation for quantum tificial intelligence in the quantum domain: a review error mitigation. arXiv preprint arXiv:2011.07064, of recent progress. Rep. Prog. Phys., 81(7):074001, 2020. 2018. [56] W. J. Huggins, J. McClean, N. Rubin, Z. Jiang, [36] A. Elben, R. Kueng, H.-Y. R. Huang, R. van Bij- N. Wiebe, K. B. Whaley, and R. Babbush. Effi- nen, C. Kokail, M. Dalmonte, P. Calabrese, B. Kraus, cient and noise resilient measurements for quantum J. Preskill, P. Zoller, and B. Vermersch. Mixed-state chemistry on near-term quantum computers. arXiv entanglement from local randomized measurements. preprint arXiv:1907.13117, 2019. Phys. Rev. Lett., 125:200501, Nov 2020. [57] A. F. Izmaylov, T.-C. Yen, R. A. Lang, and [37] T. J. Evans, R. Harper, and S. T. Flammia. Scal- V. Verteletskyi. Unitary partitioning approach to able bayesian hamiltonian learning. arXiv preprint the measurement problem in the variational quan- arXiv:1912.07636, 2019. tum eigensolver method. J. Chem. Theory Comput., [38] E. Farhi and H. Neven. Classification with quan- 16(1):190–195, 2019. tum neural networks on near term processors. arXiv [58] Z. Jiang, A. Kalev, W. Mruczkiewicz, and H. Neven. preprint arXiv:1802.06002, 2018. Optimal fermion-to-qubit mapping via ternary trees [39] S. T. Flammia, D. Gross, Y.-K. Liu, and J. Eisert. with applications to reduced quantum states learning. Quantum tomography via compressed sensing: error Quantum, 4:276, 2020. bounds, sample complexity and efficient estimators. [59] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, New J. Phys., 14(9):095022, 2012. M. Brink, J. M. Chow, and J. M. Gambetta. [40] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, Hardware-efficient variational quantum eigensolver and G. E. Dahl. Neural message passing for quantum for small molecules and quantum magnets. Nature, chemistry. arXiv preprint arXiv:1704.01212, 2017. 549(7671):242–246, 2017. [41] A. Gilyén, S. Lloyd, and E. Tang. Quantum- [60] M. J. Kearns and R. E. Schapire. Efficient inspired low-rank stochastic regression with logarith- distribution-free learning of probabilistic concepts. mic dependence on the dimension. arXiv preprint Journal of Computer and System Sciences, 48(3):464– arXiv:1811.04909, 2018. 497, 1994. [42] D. Gross, S. Nezami, and M. Walter. Schur-Weyl du- [61] C. Kokail, C. Maier, R. van Bijnen, T. Brydges, M. K. ality for the Clifford group with applications: Prop- Joshi, P. Jurcevic, C. A. Muschik, P. Silvi, R. Blatt, erty testing, a robust Hudson theorem, and de Finetti C. F. Roos, et al. Self-verifying variational quantum representations. arXiv preprint arXiv:1712.08628, simulation of lattice models. Nature, 569(7756):355– 2017. 360, 2019. [43] J. Haah, A. W. Harrow, Z. Ji, X. Wu, and [62] R. Kueng. Quantum and classical information pro- N. Yu. Sample-optimal tomography of quantum cessing with tensors (lecture notes), Spring 2019. states. IEEE Trans. Inf. Theory, 63(9):5628–5641, Caltech course notes: https://iqim.caltech.edu/ 2017. classes. [44] I. Hamamura and T. Imamichi. Efficient evaluation of [63] H. Levine, A. Keesling, A. Omran, H. Bernien, quantum observables using entangled measurements. S. Schwartz, A. S. Zibrov, M. Endres, M. Greiner, npj Quantum Inf., 6(1):1–8, 2020. V. Vuletić, and M. D. Lukin. High-fidelity control [45] R. Harper, W. Yu, and S. T. Flammia. Fast es- and entanglement of rydberg-atom qubits. Phys. Rev. timation of sparse quantum noise. arXiv preprint Lett., 121(12):123603, 2018. arXiv:2007.07901, 2020. [64] S. Lloyd, M. Mohseni, and P. Rebentrost. Quantum [46] A. W. Harrow, A. Hassidim, and S. Lloyd. Quantum algorithms for supervised and unsupervised machine algorithm for linear systems of equations. Phys. Rev. learning. arXiv preprint arXiv:1307.0411, 2013. Lett., 103(15):150502, 2009. [65] S. Lloyd, M. Mohseni, and P. Rebentrost. Quantum [47] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Har- principal component analysis. Nat. Phys., 10(9):631– row, A. Kandala, J. M. Chow, and J. M. Gambetta. 633, 2014. Supervised learning with quantum-enhanced feature [66] W. Matthews, S. Wehner, and A. Winter. Distin- spaces. Nature, 567(7747):209–212, 2019. guishability of quantum states under restricted fami- [48] C. W. Helstrom. Quantum detection and estimation lies of measurements with an application to quantum theory. J. Statist. Phys., 1:231–252, 1969. data hiding. Comm. Math. Phys., 291(3):813–843, [49] A. S. Holevo. Bounds for the quantity of information 2009. transmitted by a quantum communication channel. [67] A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dun- Problemy Peredachi Informatsii, 9(3):3–11, 1973. jko, M. Tiersch, A. Zeilinger, and H. J. Briegel. Active [50] A. S. Holevo. Statistical decision theory for quantum learning machine learns to create new quantum exper- systems. J. Multivariate Anal., 3:337–394, 1973. iments. Proc. Natl. Acad. Sci. U.S.A., 115(6):1221– [51] R. Horodecki, P. Horodecki, M. Horodecki, and 1226, 2018. K. Horodecki. Quantum entanglement. Rev. Mod. [68] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foun- Phys., 81(2):865, 2009. dations of machine learning. The MIT Press, 2018. [52] H.-Y. Huang, K. Bharti, and P. Rebentrost. Near- [69] A. Montanaro. Learning stabilizer states by bell sam- term quantum algorithms for linear systems of equa- pling. arXiv preprint arXiv:1707.04012, 2017. tions. arXiv preprint arXiv:1909.07344, 2019. [70] A. Montanaro and R. de Wolf. A survey of quantum [53] H.-Y. Huang, M. Broughton, M. Mohseni, R. Bab- property testing. Theory Comput., pages 1–81, 2016. bush, S. Boixo, H. Neven, and J. R. McClean. Power [71] M. Y. Niu, A. M. Dai, L. Li, A. Odena, Z. Zhao, of data in quantum machine learning. arXiv preprint V. Smelyanskyi, H. Neven, and S. Boixo. Learnability arXiv:2011.01938, 2020. and complexity of quantum samples. arXiv preprint [54] H.-Y. Huang, R. Kueng, and J. Preskill. Predicting arXiv:2010.11983, 2020. many properties of a quantum system from very few [72] M. Paini and A. Kalev. An approximate description measurements. Nat. Phys., 16:1050––1057, 2020. of quantum states. arXiv preprint arXiv:1910.10543, [55] W. J. Huggins, S. McArdle, T. E. O’Brien, J. Lee, 2019. 8
[73] R. G. Parr. Density functional theory of atoms and crete distributions. arXiv preprint arXiv:2007.14451, molecules. In Horizons of quantum chemistry, pages 2020. 5–15. Springer, 1980. [85] E. Tang. Quantum-inspired classical algorithms for [74] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, principal component analysis and supervised cluster- X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. ing. arXiv preprint arXiv:1811.00414, 2018. O’brien. A variational eigenvalue solver on a photonic [86] E. Tang. A quantum-inspired classical algorithm for quantum processor. Nat. Commun., 5:4213, 2014. recommendation systems. In STOC, pages 217–228, [75] K. Poland, K. Beer, and T. J. Osborne. No free 2019. lunch for quantum machine learning. arXiv preprint [87] L. G. Valiant. A theory of the learnable. Commun. arXiv:2003.14103, 2020. ACM, 27(11):1134–1142, 1984. [76] J. Preskill. Quantum computing in the NISQ era and [88] E. P. Van Nieuwenburg, Y.-H. Liu, and S. D. Huber. beyond. Quantum, 2:79, 2018. Learning phase transitions by confusion. Nat. Phys., [77] P. Rebentrost, M. Mohseni, and S. Lloyd. Quan- 13(5):435–439, 2017. tum support vector machine for big data classifica- [89] V. Vapnik. The nature of statistical learning theory. tion. Phys. Rev. Lett., 113(13):130503, 2014. Springer science & business media, 2013. [78] A. Rocchetto, S. Aaronson, S. Severini, G. Carvacho, [90] R. Vershynin. High-dimensional probability: An D. Poderini, I. Agresti, M. Bentivegna, and F. Sciar- introduction with applications in data science, vol- rino. Experimental learning of quantum states. Sci- ume 47. Cambridge university press, 2018. ence advances, 5(3):eaau1946, 2019. [91] V. Verteletskyi, T.-C. Yen, and A. F. Izmaylov. Mea- [79] M. Schuld and N. Killoran. Quantum machine learn- surement optimization in the variational quantum ing in feature hilbert spaces. Phys. Rev. Lett., eigensolver using a minimum clique cover. J. Chem. 122(4):040504, 2019. Phys., 152(12):124114, 2020. [80] R. A. Servedio and S. J. Gortler. Equivalences and [92] S. R. White. Density-matrix algorithms for quantum separations between quantum and classical learnabil- renormalization groups. Phys. Rev. B, 48(14):10345, ity. SIAM J. Comput., 33(5):1067–1092, 2004. 1993. [81] O. Sharir, Y. Levine, N. Wies, G. Carleo, and [93] N. Wiebe, D. Braun, and S. Lloyd. Quan- A. Shashua. Deep autoregressive models for the ef- tum algorithm for data fitting. Phys. Rev. Lett., ficient variational simulation of many-body quantum 109(5):050505, 2012. systems. Phys. Rev. Lett., 124(2):020503, 2020. [94] C. Zhang. An improved lower bound on query com- [82] K. Sharma, M. Cerezo, Z. Holmes, L. Cincio, A. Sorn- plexity for quantum pac learning. Inf. Process. Lett., borger, and P. J. Coles. Reformulation of the no-free- 111(1):40–45, 2010. lunch theorem for entangled data sets. arXiv preprint [95] L. Zhao, C. A. Pérez-Delgado, and J. F. Fitzsi- arXiv:2007.04900, 2020. mons. Fast graph operations in quantum computa- [83] G. Struchalin, Y. A. Zagorovskii, E. Kovlakov, tion. Phys. Rev. A, 93(3):032314, 2016. S. Straupe, and S. Kulik. Experimental estimation [96] Z. Zhou, X. Li, and R. N. Zare. Optimizing chemi- of quantum state properties from classical shadows. cal reactions with deep reinforcement learning. ACS arXiv preprint arXiv:2008.05234, 2020. Cent. Sci., 3(12):1337–1344, 2017. [84] R. Sweke, J.-P. Seifert, D. Hangleiter, and J. Eisert. On the quantum versus classical learnability of dis- 9 Appendix
Roadmap: Appendix A provides additional context and discusses relevant existing work. Details regarding the numerical experiments can be found in Section B. The remaining portions are devoted to theory and mathematical proofs. Appendix C provides a thorough treatment of average prediction errors, including bounds on query complexity (the number times the quantum process E is accessed) culminating in a proof of Theorem 1. Appendix D provides a stylized example demonstrating that the bound is tight. Finally, Appendix E provides sample complexity upper and lower bounds for predicting many Pauli expectation values with small worst-case error (leading to exponential separation in query complexity).
Appendix A: Related works
a. Quantum PAC learning: It is instructive to relate our main result on achieving average-case prediction error (Theorem 1) to quantum probably approximately correct (PAC) learning [11–13, 29, 31, 80, 94]. The latter rigorously established the absence of information-theoretic quantum advantage for learning classical Boolean functions h : {0, 1}n → {0, 1}. This is a special case of predicting functions of the form f(x) = tr (O E(|xihx|)). To see this, we reversibly encode every n-bit Boolean function h in a (unitary) CPTP map that acts on (n + 1) qubits:
† n Eh(ρ) = UhρUh, where Uh |x, ai = |x, h(x) ⊕ ai for all x ∈ {0, 1} , a ∈ {0, 1}, (A1)
where ⊕ is addition in Z2. This is the quantum oracle for implementing the Boolean function h. Fix O = ⊗n I ⊗ Z = Zn+1 — a Pauli-Z operator acting on the final qubit — to conclude tr(O Eh(|xihx|)) = h(x). In quantum PAC learning, the quantum ML models (or quantum learners) are restricted to quantum samples P p of the form x D(x) |xi |h(x)i, where D(x) is the probability for sampling x in the input distribution D and h(x) is the Boolean function in question. Furthermore, quantum PAC learning focuses on the worst-case input distribution D?. This leaves open the potential for large quantum advantages in more general settings. For instance, the quantum ML could be allowed to access the CPTP map Eh or we may want to consider an input distribution with the largest classical-quantum separation. Theorem 1 closes these open questions showing that no large quantum advantage in sample complexity (or query complexity) is possible even in the much more general setting of learning f(x) = tr(O E(|xihx|)). Note that direct access to Eh would also enable the construction of quantum PAC samples (if we assume the quantum ML model depends on the fixed input P p distribution D). The quantum ML model simply inputs the pure state x D(x) |xi |0i to Eh to obtain X p X p Uh D(x) |xi |0i = D(x) |xi |h(x)i , (A2) x x which is the quantum sample considered in quantum PAC learning. For related works on PAC learning a distribution rather than a function, see [71, 84]. While existing results [11–13, 29, 31, 80, 94] have shown that no large quantum advantage in sample complex- ity is possible, we can still have substantial quantum speedups in computational complexity. In Ref. [80], for instance, a contrived learning problem is constructed based on factoring. This work showcases the possibility of a quantum advantage in computational complexity, even if no quantum advantage in sample complexity is possible. On the other hand, exponential separation in sample complexity is possible if one considers worst-case approximation errors. b. Quantum machine learning: Quantum computers have the potential to improve existing machine learn- ing models based on classical computers. A series of works require that an exponential amount of data is stored in a quantum random access memory [20, 35, 46, 52, 64, 65, 77, 93]. Due to the quantum data structure, one can obtain exponential speed-up for certain machine learning tasks. However, the act of storing the exponential amount of data will take exponential time. Furthermore, if we assume a similar data structure for classical machines, then the exponential advantage may vanish altogether [3, 10, 41, 85, 86]. Another recent line of works focuses on using quantum computers to represent a set of classically intractable functions [32, 38, 47, 79]. However, computational power provided by data in a machine learning task can also make a classical ML model stronger [53] and may challenge some of these proposals in practice. When we consider learning an arbitrary unitary, due to the inherent complexity of the unitary group, the sample complexity is going to scale with the Hilbert space dimension (exponential in the system size) [75, 82]. c. Classical learning theory: For a general introduction to classical learning theory, see the comprehensive book on the foundations of machine learning [68]. The classical learning strategy is the same as learning probabilistic real-valued functions. Existing works on learning real-valued functions [7, 14], or probabilistic Boolean functions [60], mainly focus on the worst-case input distribution that maximizes sample complexity. Geometric quantities, such as the fat-shattering dimension [7], then provide a characterization of the sample complexity [7]. For example, [14] proves that a (worst-case) sample complexity upper bound is given by O˜(fat(/5)/2), where O˜(·) suppresses logarithmic factors. In contrast, our proof of Theorem 1 produces a p 2 sample complexity upper bound O (log(|M4(F f )|)/)) that scales with 1/ instead of 1/ . The geometric 10
p constant log(|M4(F f )|) is also different (it measures the cardinality of a maximal packing net that depends on the input distribution). d. Shadow tomography: Shadow tomography is the task of simultaneously estimating the outcome prob- abilities associated with M 2-outcome measurements up to accuracy : pi(ρ) = tr(Eiρ), where each Ei is a positive semi-definite matrix with operator norm at most one [4, 5, 23, 26]. The best existing result is given 2 4 by [26]. They showed that N = O log(M) log(d)/ copies of the unknown state suffice to achieve this task. Their protocol is based on an improved quantum threshold search: finding an observable Ei with the expec- tation value tr(Eiρ) exceeding a certain threshold. Then, they combine this with online learning of quantum states following Aaronson’s original protocol. [4]. A more experimentally friendly shadow tomography protocol has been proposed in [54]. It only yields competitive scaling complexities for a restricted set of observables, but can be implemented on state-of-the-art quantum platforms [36, 83]. e. PAC learning quantum systems: A precursor to shadow tomography is PAC learning of quantum states [2, 78], where the goal is to accurately predict expectation values of different observables in an unknown quantum system ρ up to a small average error. This fits nicely into the scope of Theorem 1: The optimal fully quantum ML model that can perform quantum data analysis on many copies stored in the quantum memory will not yield a large advantage in sample complexity over ML models that make predictions based solely on classical measurement data from randomized measurements on single copy of ρ. This separation between learning from classical measurement data and learning from the quantum states coherently has not been discussed in [2]. The result [2] could be seen as establishing an upper bound on the maximal packing net for the set of CPTP maps F. This then translates into a sample complexity upper bound needed to achieve good prediction performance. f. Measuring expectation values of Pauli observables: Due to the importance of measuring Pauli observ- ables in near-term applications of quantum computers, a series of methods [22, 34, 44, 56–58, 91] have been proposed to reduce the number of measurements needed to estimate Pauli expectation values. All of them are based on one basic, yet powerful, observation: Commuting observables can be measured simultaneously. For example, quantum chemistry applications are often contingent on measuring O(n4) Pauli observables [74]. The aforementioned Pauli estimation protocols group these observables into O(n3) or even only O(n) commuting groups. In turn, O(n3) or O(n) copies of the underlying state suffice to obtain expectation values for all O(n4) relevant Pauli observables by exploiting the ability to simultaneously measure commuting observables. On the other hand, the novel technique proposed in Appendix E gets by with even fewer state preparations. A total of O(log(n4)) = O(log(n)) copies suffice. Restriction to few-body Pauli observables can yield additional improvements. Several protocols are known for this special case, see e.g. [22, 33, 37, 54, 58, 72]. g. Incoherent versus coherent measurements: The exponential advantage of quantum ML over classical ML established by our work may be viewed as an exponential separation between coherent measurements (in which a measurement apparatus interacts coherently multiple times with a measured system, storing quantum data which is then processed by a quantum computer) and incoherent measurements (in which a POVM measurement is performed and the outcome recorded after each interaction between system and apparatus, and the classical measurement outcomes are then processed by a classical computer). Existing work has shown an advantage of coherent measurements over independent incoherent measurements, a special case in which the POVM measurements do not depend on the results of previous measurements; see e.g., [43] on quantum state tomography and [54] on shadow tomography. However, few prior results limit the power of incoherent measurements in the adaptive setting, in which each measurement performed may depend on the outcomes of all the previous measurements. A prior result we were aware of before obtaining our result was [24], showing a mild polynomial advantage of coherent over incoherent measurements for the task of distinguishing whether an unknown quantum state is close to the completely mixed state or not. Our work established an exponential separation between incoherent and coherent measurements in sample complexity (the number of copies of a quantum state needed to perform the task) for shadow tomography. After our work was complete, [6] also presented a detailed analysis of the power of coherent and incoherent measurements, finding an exponential separation between incoherent and coherent measurements for the task of distinguishing between different types of quantum channels.
Appendix B: Details of numerical experiments
1. Mixed states
n For the case of mixed states given by ρ = (I + Px∗ )/2 , we have f(x) = tr(Pxρ) = δx,x∗ for all x ∈ {I,X,Y,Z}n. That is, f(x) is a point function, i.e. f(x∗) = 1 for exactly one x∗, all other Pauli strings evaluate to zero. a. Restricted classical ML: We consider a restricted classical ML model that implements an exhaustive n search over all Pauli observables Px with x ∈ {I,X,Y,Z} . For each x, we repeatedly measure the observable Px and check whether the outcome −1 ever occurs. If this is the case, we know f(x) = 0 with certainty (recall, that f(x) ∈ {0, 1} is a point function). After looping through all observables, we will be left with a single input x∗ that obeys f(x∗) = 1. There are a total of 4n inputs and whenever x 6= x∗, the expected number of ∗ measurements required to obtain the outcome −1 is 1/(Prx[−1]) = 2. In contrast, for x = x , outcome −1 11
can never occur. This results in a sample complexity of O(4n) – a scaling that is confirmed by our numerical experiments and matches the lower bound Ω(4n). b. Classical ML: For classical ML, we implement property prediction with classical shadows based on random Clifford measurements [54]. The sample complexity to predict M Pauli observables up to small 2 2 ⊗n n n constant accuracy is known to be O(maxx tr(Px ) log(M)). Using tr(Px ) = tr(I ) = 2 and M = 4 , this upper bound simplifies to O(n2n). The numerical experiments confirm this theoretical prediction. c. Quantum ML: For quantum ML, we use the procedure introduced in Appendix E 2. We first perform 2 n several repetitions of two-copy Bell basis measurements to estimate absolute values | tr(Pxρ)| for all 4 possible inputs x. This allows us to immediately identify x∗. One could solve for x∗ by performing Gaussian elimination in GF (2). A similar strategy has also been used for learning quantum channels [45]. The required sample complexity is O(log(4n)) = O(n) and the numerical experiments confirm this linear scaling.
2. Product states
We consider the unknown n-qubit state to be a tensor product of n single-qubit stabilizer states (there are only six choices |0i , |1i , |+i , |−i , |y, +i , |y, −i). We only need to obtain measurement outcomes for single-qubit Pauli observables to completely determine such product states. a. Restricted classical ML: Recall that the goal of this task is to predict f(x) = tr(Pxρ) for an unknown quantum state ρ. The restricted classical ML can collect measurement data of the form {(xi, oi)}, where n xi ∈ {I,X,Y,Z} and oi ∈ ±1 is the measurement outcome when we measure Px on ρ. We simply collect measurement outcomes for the 3n single-qubit Pauli observables by choosing the appropriate xi. For each qubit, we sample ±1-outcomes in the X-, Y - and Z-basis. Once we find that two of the three Pauli observables result in both the outcome ±1, we can determine the single-qubit state for the particular qubit. For two of the three bases, the underlying distribution is uniform, while deterministic outcomes are produced in the third basis. In turn, we expect to require 6n Pauli measurements to unambiguously characterize the underlying stabilizer state. This scaling is confirmed by the numerical experiments. b. Classical ML: We associate classical ML models with reconstruction procedures that can perform ar- bitrary POVM measurements on individual copies of the unknown state ρ. Here, we consider a sequence of POVMs where we measure first in the all-X basis, second in the all-Y basis, and then in the all-Z basis (and repeat). Similar to the restricted classical ML, we also check if two of the three possible single-qubit observables X,Y,Z have resulted in both outcomes ±1. Once two of the three single-qubit observables have produce both outcomes ±1, we can perfectly identify the corresponding single-qubit stabilizer state. But, to complete the assignment, we need to be able to do so for all n qubits. This incurs an additional logarithmic factor in the expected number of measurements. We expect to require O(log(n)) measurement repetitions to unambiguously determine the underlying product state. Numerical experiments confirm this scaling behavior. c. Quantum ML: Quantum ML models can perform quantum data processing on multiple copies of quan- tum states. For product states, we only perform the following procedure. For each repetition, we perform two-copy Bell basis measurements on ρ ⊗ ρ to simultaneously measure X ⊗ X,Y ⊗ Y,Z ⊗ Z on each of the n qubits in ρ. We can determine the eigenbasis (X-basis, Y-basis, or Z-basis) of each qubit when we found that two of the observables X ⊗ X,Y ⊗ Y,Z ⊗ Z result in both ±1 outcomes. For two of the three observables, the underlying distribution is uniform in ±1, while deterministic outcomes are produced in the third observable. Since we know the stabilizer bases for each qubit after a few two-copy measurements, we simply measure the n qubits in the corresponding stabilizer basis on the final copy to recover the full state. We refer to [42, 69, 70, 95] for results on efficient procedures for testing and learning stabilizer states in general.
Appendix C: Proof of Theorem 1
This section contains a thorough treatment of average prediction errors. We consider related setups for the classical and quantum learning settings. The learning problem is defined by a set of CPTP maps F, an input distribution D, and an observable O with kOk ≤ 1. Each CPTP map E ∈ F maps a n-qubit quantum state to m-qubit state. This collection defines a function
n fE (x) = tr(O E(|xihx|)) : {0, 1} → R (C1)
The goal is to learn a function f : {0, 1}n → R such that with high probability
2 Ex∼D|f(x) − fE (x)| is small. (C2)
A bit of additional context is appropriate here: we are studying the existence of learning algorithms under a fixed learning problem defined by input distribution D, observable O, and set of CPTP maps F. In turn, the actual learning algorithms may and, in general, will depend on these mathematical objects. 12
One of our main technical contributions – Theorem 1 – highlights that a substantial quantum advantage is impossible for this setting (small average-case prediction error). This is in stark contrast to the setting of achieving small worst-case prediction error. The proof consists of two parts. Section C 1 establishes a lower bound for the query complexity of any quantum ML model. Subsequently, Section C 2 provides an upper bound for the query complexity achieved by certain classical ML models. Finally, a combination of these two results establishes Theorem 1, see Section C 3.
1. Information-theoretic lower bound for quantum machine learning models
The quantum machine learning model consists of learning phase and prediction phase. In the learning phase, the quantum ML model accesses the quantum experiment characterized by the CPTP map E for NQ times to learn a model. We consider the quantum ML model to be a mixed state quantum computation algorithm (a generalization of unitary quantum computation). Starting point is an initial state ρ0 on any number of qubits. Subsequently, arbitrary quantum operations Ct (CPTP maps) are interleaved with in total NQ invocations E ⊗I of the unknown (black box) CPTP map and produce a final state
ρE = CNQ (E ⊗I)CNQ−1 ... C1(E ⊗I)(ρ0). (C3)
In this model, we can assume without loss that E always acts on the first n qubits, because the quantum operations Ct are unrestricted. In particular, they could contain certain SWAP operations that permute the qubits around. The final state ρE is the quantum memory that stores the prediction model learned from the CPTP map E using the quantum ML algorithm. Obtaining ρE concludes the quantum learning phase. In the prediction phase, we assume that new inputs are provided as part of a sequence
n x1, x2, x3,... ∈ {0, 1} . (C4)
For each sequence member xi, the quantum ML model accesses the input xi, as well as the current quantum memory. It produces an outcome by performing a POVM measurement on the quantum memory ρE . We emphasize that this can, and in general will, affect the quantum memory nontrivially. The quantum ML outputs hQ(xi) depend on the entire sequence x1, . . . , xi. And different sequence orderings will produce different predictions. For example, when n = 2, the following ordering may result in the prediction
x1 = 00, x2 = 01, x3 = 10, x4 = 11 → hQ(00) = −0.3, hQ(01) = 0.5, hQ(10) = 0.2, hQ(11) = −0.7, (C5) but a different ordering may result in a slightly different prediction, such as
x1 = 11, x2 = 01, x3 = 10, x4 = 00 → hQ(00) = −0.2, hQ(01) = 0.5, hQ(10) = 0.2, hQ(11) = −0.6. (C6)
Also, note that hQ(xi) can be randomized because a quantum measurement is performed to produce the prediction outcome. The ordering does not affect the theorem we want to prove. In the following, we will fix the input ordering to be an arbitrary ordering. For example, we can use the input ordering such that the quantum ML model has the smallest prediction error. After fixing an input ordering, we can treat the entire prediction phase (taking a sequence of inputs x1, x2,... and producing hQ(x1), hQ(x2),...) as an enormous POVM measurement on the output state ρE obtained from the learning phase. Each outcome a from the enormous POVM measurement on the output state ρE corresponds n to a function hQ,a(x): {0, 1} → R. Using Naimark’s dilation theorem, every POVM measurement is a projective measurements on a larger Hilbert space. Since the quantum memory that the quantum ML model can operate on contains an arbitrary amount of qubits, we can use Naimark’s dilation theorem to restrict the enormous POVM measurement to a projective measurement {Pa}a. Hence, for any CPTP map E ∈ F, when we asks the quantum ML model to produce the prediction for an ordering of inputs x1, x2,..., the output values hQ(x1), hQ(x2),... will be given by
hQ,a(x) with probability tr(PaρE ) = tr(PaCNQ (E ⊗ I)CNQ−1 ... C2(E ⊗ I)C1(E ⊗ I)C0(ρ0)), (C7) P for a projective measurement {Pa}a with a Pa = I. Finally, we will assume that the produced function hQ(x) achieves small prediction error
2 E |hQ(x) − tr(OE(|xihx|))| ≤ with probability at least 2/3. (C8) x∼D for any CPTP map E ∈ F. This assumption asserts that
A h i X 1 2 tr(PaρE ) E |hQ,a(x) − tr(OE(|xihx|))| ≤ ≥ 2/3, (C9) x∼D a=1 where 1[z] denotes the indicator function of event z. That is, 1[z] = 1 if z is true and 1[z] = 0 otherwise. 13
a. Maximal packing net
We emphasize that Rel. (C9) must be valid for any E ∈ F. Because we only need to output a function hQ(x) that approximates f(x) = tr(O E(|xihx|)) on average, the task will not be hard when there are only a few qualitatively different CPTP maps in F. However, the problem could become harder when F contains a large amount of very different CPTP maps. The task is now to transform this requirement into a stringent lower bound on NQ – the number of black-box uses of the unknown CPTP map E ⊗I within the quantum computation (C3). As a starting point, we equip the set of target functions F f = {fE (x) = tr(OE(|xihx|))| E ∈ F} with a packing net. Packing nets are discrete subsets whose elements are guaranteed to have a certain minimal pairwise distance (think of spheres that must not overlap with each other). We choose points (functions) fEi ∈ Ff and demand
2 E |fEi (x) − fEj (x)| > 4 whenever i 6= j. (C10) x∼D
p We denote the resulting packing net of Ff by M4(F f ) and note that every such set has finitely many elements p (Ff is a compact set). We also assume that M4(F f ) is maximal in the sense that no other 4-packing net can contain more points (functions). It is possible to utilize packing nets to derive a query complexity lower bound for the quantum machine learning model. In fact, we will present two different proof strategies. The first proof is inspired by [39, 43, 54] and analyzes a communication protocol. The second proof uses a proof technique that depends on an analysis of polynomials similar to [16]. While it is somewhat weaker than the information-theoretic bound in the first proof, we include the derivation for completeness as we believe that it may be insightful for the interested reader.
b. Proof strategy I: mutual information analysis
Let us define a communication protocol between two parties, say Alice and Bob. They use the packing p net M4(F f ) as a dictionary to communicate randomly selected classical messages. More precisely, Alice p samples an integer X uniformly at random from 1, 2,..., |M4(F f )| and chooses the corresponding CPTP map p EX ∈ M4(F f ). When Bob wants to access the unknown CPTP map EX , he will ask Alice to apply the CPTP map EX . Bob will then execute the quantum machine learning model (C3) to obtain a prediction model hQ,a(x), where a parameterizes the prediction model. Subsequently, Bob solves the following optimization problem ˜ 2 X = arg min E |hQ,a(x) − tr(OEX0 (|xihx|))| (C11) 0 p x∼D X =1,...,|M4(F f )| ˜ to obtain an integer X. This decoding procedure seems adequate, provided that the prediction model hQ approximately reproduces the true underlying function. More precisely, assumption (C8) asserts
2 E |hQ,a(x) − tr(OEX (|xihx|))| ≤ with probability at least 2/3. (C12) x∼D
p 0 Here is where the choice of dictionary matters: M4(F f ) is a packing net, see Equation (C10). For X 6= X this necessarily implies
1/2 1/22 2 2 2 E hQ,a(x) − fE 0 (x) ≥ E fEX (x) − fE 0 (x) − E |hQ,a(x) − fEX (x)| (C13) X∼D X X∼D X X∼D √ √ 2 > 2 − = . (C14)
This allows us to conclude that Bob’s decoding strategy (C11) succeeds perfectly if
2 E |hQ,a(x) − tr(OEX (|xihx|))| ≤ . (C15) x∼D
In turn, Assumption C8 ensures X˜ = X (perfect decoding) with probability at least 2/3. p Now, we use the fact that Alice samples her message X uniformly at random from a total of |M4(F f )| integers. Because X˜ = X (perfect decoding) with probability at least 2/3, Fano’s inequality implies that ˜ p H(X|X) ≤ H(1/3) + log(|M4(F f )|)/3, (C16)
where H(x) = −x log x − (1 − x) log(1 − x) is the binary entropy. This gives a lower bound on the mutual information between sent and decoded message, namely
˜ ˜ 2 p p I(X : X) = H(X) − H(X|X) ≥ log(|M (F f )|) − H(1/3) = Ω (log(|M (F f )|)) . (C17) 3 4 4 14
˜ Next, note that X is obtained by classically processing a measurement outcome a of the quantum state ρEX The data processing inequality and Holevo’s theorem [9, 18, 49, 51] then imply ˜ I(X : X) ≤ I(X : a) ≤ χ(X : ρEX ). (C18)
The Holevo χ quantity between the classical random variable X and the quantum state ρEX is χ(X : ρEX ) = S E ρEX − E S (ρEX ) , (C19) X X where S(ρ) = tr(−ρ log ρ) is the von Neumann entropy. Throughout this work, we refer to log with base e.
Recall that Bob produces ρEX by utilizing a total of NQ channel copies obtained from Alice. We can use the specific layout (C3) of Bob’s quantum computation to produce an upper bound on the Holevo-χ:
χ(X : ρEX ) ≤ O(mNQ) (C20) This bound follows from induction over a sample-resolved variant of Bob’s quantum computation. For t = 0, 1,...,NQ, we will show that ρt = C (E ⊗ I)C ... C (E ⊗ I)C (ρ ) χ(X : ρt ) ≤ (2 log 2)mt. E t t−1 1 0 0 obeys EX (C21)
Bound (C20) then follows from recognizing that setting t = NQ reproduces Bob’s complete computation, see Equation (C3). t = 0 ρ0 = C (ρ ) X The base case ( ) is simple, because EX 0 0 does not depend on at all. This ensures