arXiv:2001.10477v3 [quant-ph] 29 Oct 2020 a aea otplnma datg vrterclassical their over advantage stor polynomial naturally most is at set algorithm have learning data machine can the quantum structure, if quantum even a in is, that data, training 15]. [14, advantage quantu superopolynomial which a for have problems algorithms learning exist there regime, this ontrl u xoeta datgsfrlann problem learning for e advantages no exponential where out result rule our not that remark do We polynomia algorithms. can- classical learning therefore over machine and time speedups data exponential training run- achieve the polynomial of not least dimension the at in have quan time must that show algorithms to learning insights these tum on leverage th We in set. samples of training number the problem with learning polynomially sh supervised inverse that scales a series of theory a parameter learning are accuracy statistical the work in how our results of known elements well key of The guaran estimators. statistical set the the data of considering the when of introduced size the is Our on that dependency the problems. on learning focuses analysis supervised for algorithms quantum upt[] utemr,i n sue hti se is it that the assumes to one access if of Furthermore, modes (s [8]. and data, matrix output sparsity), input data or the the number to condition of access as properties quantum structural of as on form such restrictions strong applicability a practical for their need limit that caveats of 7]. [6, squares least quantum and quantum 5], [3], [4, machines regression vector linear these support of Examples quantum are 2]. gorithms [1, fas counterparts exponentially classical are their that than runtimes achieve al to linear subroutines quantum fast exploit examples relation) given input-output mapping an a of infer to is goal the (where problems fteqatmagrtm a ecnieal etr [9–13 better) considerably be can algorithms clas quantum sca the that the of show (albeit slower to polynomially possible only are is algorithms it cal then p norm, way their a to in portional data training the of elements sample (classically) u eut r needn ftemdso cest the to access of modes the of independent are results Our nti ae ecniu oivsiaetelmttosof limitations the investigate to continue we paper this In aeu nlsso hs loihsietfidanumber a identified algorithms these of analysis careful A learning supervised for algorithms quantum of class wide A ffi in lsia loihsaekon nfc,in fact, In known. are algorithms classical cient sntrlyavailable. naturally is tms oyoilsedp vre learni over machine speedups quantum polynomial most made, at are that problem show the on We assumptions the supervis in accuracy. for runtimes polylogarithmic target algorithms achieve a available—cannot learning reach machine to quantum learner account, a by required ihntefaeoko ttsia erigter ti po is it theory learning statistical of framework the Within 5 eateto optrSine nvriyCleeLondon College University Science, Computer of Department 3 al nttt o hoeia hsc,Uiest fCal of University Physics, Theoretical for Institute Kavli al Ciliberto, Carlo 2 eateto optrSine nvriyo ea tAust at Texas of University Science, Computer of Department ttsia iiso uevsdQatmLearning Quantum Supervised of Limits Statistical meilCleeLno,Lno W B,Uie Kingdom United 2BT, SW7 London London, College Imperial 1 1 eateto lcrcladEetoi Engineering, Electronic and Electrical of Department nraRocchetto, Andrea 4 6 NI irapoetta,Prs702 France 75012, Paris team, project Sierra - INRIA ak iie,N J odn ntdKingdom United London, 3JP N4 Limited, Rahko ffi in lsia loihs vni ae hr unu acc quantum where cases in even algorithms, classical cient ffi in to cient gebra ling tees uch ,3 2, the ow ro- al- ter si- ed m ]. e s s - l lsadoRudi, Alessandro so-called h aki called is task the yia osfntosaeteqartcloss quadratic the are functions loss Typical ssalwt epc oasuitable a to respect with small is h olo uevsdlann st rdc hypothesis a produce to is learning supervised of goal The ac nnwosrain.Ti atrpoet,as known also property, latter as This observations. examples new per training on prediction input-output mance good of guarantees importantly, set more a that, we but well fi fits work, to that this is model goal In the a where [16]. settings problem learning resour supervised learning consider statistical a the solve to quantify required to how investigates theory Theory. Learning Statistical accuracy). maximum determine can with samples function case zero the limit here known, the is (consider function samples the of where number dependen in the improve accuracy to the possible of is it function algo- target the assumpti learning to on stronger Using us possible networks. allows every neural This including virtually rithm, learned. on be statements to make function the on knowledge variants. erigi odsg ttsia siaosal o‘t we ‘fit’ to able machine examples. estimators ture of statistical focus design the to practice, is approac learning in often problem is optimisation fitting an data as while Indeed, optimis standard literature. the tion from learning machine separating pect ( finite esrn rdcinerr.Hwvr npatc,tetar the practice, in distribution However, errors. prediction measuring ape rmit. from sampled f nu ieso.W ocueta,we ofurther no when that, conclude We dimension. input f : ( ial,w oeta u eut ontasm n prior any assume not do results our that note we Finally, oefral,let formally, More eedn nwehrtelblset label the whether on Depending eeaiaincapability generalisation dlann—o hc ttsia urnesare guarantees statistical which learning—for ed sbet on h iiu ubro samples of number minimum the bound to ssible x X ) gagrtm o uevsdlann a have can learning supervised for algorithms ng fri,SnaBraa A916 USA 93106, CA Barbara, Santa ifornia, → riigstS set training − ftebudo h cuayi ae into taken is accuracy the on bound the if odnW1 E,Uie Kingdom United 6EA, WC1E London , y Y ) oan(ript set input) (or domain 2 uhta the that such ρ over 4 sukonadol cesbeb en fa of means by accessible only and unknown is n utn X772 USA 78712, TX Austin, in, n enr Wossnig Leonard and regression Y E n = ( ρ = f : ) eadsrbto over distribution a be R xetdrisk expected { ( = x o ersinadte0 the and regression for dne or (dense) i E , ftelandmdl sakyas- key a is model, learned the of y ρ h edo ttsia learning statistical of field The i  ) ℓ , and ( i y osfunction loss = , s otedata the to ess f Y 1 ( ,6 5, classification or x , . . . , a )) Y ae o upt set output) (or label xetderror expected  sdneo discrete or dense is n } X fiid points i.i.d. of ℓ ℓ × sq : Y ( Y f (discrete). with , × ( − x Y ) loss 1 , y lfu- ll → ) for- hed X ons ces get (1) nd cy a- R = a . 2

3 ℓ0 1( f (x), y) = 1 f (x),y over Y = 1, 1 for classification. or SVM, the computational time is therefore (n ), which is −Different frameworks{− } have different pre- similar to the time it requires to invert a squareO matrix that has scriptions on how to choose the hypothesis f . The Empirical size equal to the number n of examples in the training set. No- Risk Minimisation (ERM) framework prescribes to choose a tably this can be improved depending on the sparsity and the hypothesis that minimises the empirical risk conditioning of the specific optimisation problem. To reduce the computational cost, instead of considering 1 inf ˆ( f ), ˆ( f ) = ℓ(yi, f (xi)), (2) the optimisation problem as a separate process from the sta- f E E n ∈H (xX,y ) T tistical one, more recent methods hinge on the intuition that i i ∈ reducing the computational burden of the learning algorithm over a suitable hypotheses space . Under weak assumptions can be interpreted as a form of regularisation on its own. For on (for instance a bounded subsetH of a Hilbert space [16]). H instance, approaches are now widely used in it is possible to guarantee the existence of a minimizer for (2) practice, and perform only a limited number of steps of an ˆ ˆ that we denote f = arg minf ( f ). iterative optimisation algorithm, to avoid overfitting the train- ∈H E The difference between risk and empirical risk is called ing set. They thereby entail less operations, while provably generalisation error and plays a central role in statistical maintaining the same generalisation error of approaches such learning theory. Indeed, when (1) admits a minimizer in , H as Tikhonov regularisation [21]. More specifically, prototyp- we have ical results (such as [21]) show that the number of iterations required are of the order of 1/λ where λ is the ideal regulari- ˆ ˆ ( f ) inf ( f ) 2 sup ( f ) ( f ) . (3) sation parameter that one would use for ERM. Therefore, if in E − f E ≤ f E −E ∈H ∈H the worst case scenario λ = O(1/ √n), early stopping would In other words, the excess risk incurred by the empirical risk attain (up to constants) the same generalisation error of regu- minimizer is controlled by the worse generalisation error over larised ERM by performing only √n iterations. . A fundamentalresult in statistical learning theory [16–18], A different approach, also known as divide and conquer, is H often referred in the literature as the fundamental theorem of based on the idea of distributing portions of the training data statistical learning, is that for every n N, δ (0, 1), and ev- onto separate machines, each solving a smaller learning prob- ∈ ∈ ery distribution ρ, the following holds with probability larger lem, and then combining individualpredictors into a joint one. than 1 δ This computation hence benefits from both the parallelisation − and the reduced dimension of distributed datasets while simi- c ( ) + log(1/δ) sup ˆ( f ) ( f ) Θ H , (4) larly maintaining statistical guarantees [24].  r  f E −E ≤ n A third approach that has recently received significant at- ∈H     tention from the machine learning community, along with where c ( ) is a measure of the complexity of (such as the the quantum community, is based on random sub- VC dimension,H covering numbers, RademacherH complexity to sampling, a form of . Depending on name a few [16, 19]). Intuitively, the dependency on c( ) how such sampling is performed, different methods have been in (4) models the phenomenon known as overfitting in whichH proposed, the most well-known being random features [25] a large hypothesis space incurs in low training (empirical) er- and Nystr¨om approaches [26, 27]. Here the computational ad- ror but performs poorly on the true risk. This problem can vantage is clearly given by the smaller dimensionality of the be addressed with so-called regularisation techniques, which hypothesesspace, and it has also recently been shown that it is essentially limit the expressive power of the learned estimator possible to obtain equivalent generalisation error to classical in order to avoid overfitting the training dataset. methods in these settings [28]. Different regularisation strategies have been proposed in the literature (see [17, 20, 21] for an introduction to the main For all these methods, training times can be typically re- duced from the (N3) of standard approaches to (N2) or ideas), and one of the well-established approaches which di- O O (Nz), where z is the number of non-zero entries, while keep- rectly imposes constraints on the hypotheses class of candi- e ingO the statistical performance of the learned estimator essen- date predictors is the Tikhonov regularisation. Regularisa- e tion ideas have led to popular machine learning approaches tially unaltered. which are widely used in practice, such as Regularised Least Squares [19], Gaussian Process (GP) Regression and Classi- Quantum learning algorithms. Linear algebra subroutines fication [22], [20], and Support Vector are a central computational element of learning algorithms. Machines (SVM) [17]. All these algorithms can be studied A large class of quantum algorithms for supervised learning within the framework of kernel methods [23]. problems claim exponential speed-ups compared to classical From a computational perspective, these approaches com- algorithms by making use of fast quantum linear algebra sub- pute a solution for the learning problemby optimising overthe routines [3–7, 29, 30]. One widely used algorithm is the quan- constraint objective, which typically consists of a sequence of tum linear system solver [31] (also known as HHL after the standard linear algebra operations such as matrix multiplica- three authors Harrow, Hassidim, and Lloyd). The algorithm tion and inversion. For most classical algorithms, such as GP takes as input a quantum encoding of the vector b Rn and ∈ 3

n n a s-sparse matrix A R × , with A 1, and outputs an ap- It is possible to get rid of the dependency on the Frobe- ∈ k k ≤ 1 proximation w˜ of the solution w = A− b of the linear sys- nius norm using the sample based Hamiltonian simulation | i method [32, 33]. Leveraging this technique, [5] proposed a tem such that E least squares algorithm whose scaling does not depend on the w˜ w γ (5) k| i−| ik ≤ Frobenius norm but requires a higher number copies (with re- spect to [7]) of the input density matrix. Note that, because for an error parameter γ> 0. The current best implementation the algorithm in [5] is posed in the query model, i.e. the com- of the algorithm runs in time [7] putational complexity is given in number of calls to the oracle which returns the data already encoded in form of a quantum O( A F κ polylog(κ, n, 1/γ)), (6) k k state, it is not possible to make a direct comparison between where A is the Frobenius norm of A and κ its condition the two algorithms. The computational complexity of the al- k kF number. Note that the HHL algorithm requires to access the gorithm given in [5] is n d data matrix A R × in O(polylog(nd)) time. All the quan- ∈ 2 3 tum learning algorithms we discuss in this paper inherit this O(κ γ− polylog(n)), (9) assumption. Recently, it was proven that such strong oracular assumptions (when the data matrix is low-rank) also lead to and the dependency on the error is polynomial. exponentially faster classical algorithms [9, 10, 12, 13]. We recommend [2, 8] for more detailed discussions of the limits Quantum speed-ups and statistical bounds. In this section we of quantum learning algorithms based on fast linear algebra analyse the speed-up claims of quantum machine learning al- subroutines. gorithms using the framework of statistical learning theory. 1/2 Before proceeding to the statistical analysis of quantum Our main point is that if one considers the Θ(n− ) scaling learning algorithms we review some quantum algorithms for of the generalisation error—see (4)—quantum learning algo- the least squares problem. These will serve as the main exam- rithms cannot achieve polylogarithmic runtime in n. ples in our analysis. The starting point of our discussion is the following stan- dard error decomposition. Consider an hypothesis f . We want Quantum least squares. Least squares is an algorithm for min- to bound how far the generalisation error of f is from the best imising the empirical risk, with respect to the squared loss, for possible generalisation error; this is known as the Bayes risk the hypothesis class of linear functions. More specifically let and is indicated by ∗ := inf f ( f ), where denotes the E ∈F E F = Rd = R = Rd set of all measurable functions f : X Y. We want to de- X and Y , and let : f : X Y w : → f (x) = wT x be the hypothesisH class{ of linear→ functions.|∃ ∈ The compose this general error into different components and for } empirical risk is this reason we introduce := inf f ( f ), that is the best risk attainable by functionEH in the hypothesis∈H E space . In or- n H 1 T 2 der to simplify our discussion let us assume that always ˆ( f ) := w xi yi . (7) EH E n − admits a minimizer f (it is possible to levy this as- Xi=1   H sumption using the theory∈ ofH regularisation). Recalling that We can minimise the empirical risk by setting its gradient to ˆ( fˆ) := inf f ˆ( f ), we can decompose the total error as: T E ∈H E zero. Using X := i xi xi and b := i yi xi one can write a 1 ˆ ˆ close form solution to the least squares problem as w = X− b. ( f ) ∗ = ( f ) ( f ) + ( f ) + ∗ P P E −E E −E E −E H E H −E Several quantum algorithms for least squares (or, more gen- Optimisation error Estimation error Irreducible error erally, problems) have been proposed [4, 6, | {z } | {z } | {z }(10) 7, 29, 30]. A common feature is that they use a fast quan- = ξ +Θ(1/ √n) + µ. (11) tum linear system algorithm to find a quantum encoding w of the solution vector w = X 1b. The fastest known algorithm| i − The first term in (10) is the optimisation error and measures in the class [7], which improves the dependency on the error how good is the optimisation that generated f with respect from polynomial to logarithmic, solves the (regularized) least to the ideal minimisation of the empirical risk. This error is squares or linear regression problem in time related to the approximation error of the algorithm. The sec- ond term is the estimation error and models the error that we O( X F κ polylog(n,κ, 1/γ)), (8) k k make by estimating the true risk using samples from the dis- where κ is the condition number of X and γ> 0 is an approxi- tribution ρ. This is the generalisation bound we discussed in mation parameter. As for every other quantum algorithm dis- (4). The third term is the irreducible error and measures how cussed in this paper the quantum least squares solver requires well the hypothesis space models the problem. It is an irre- a quantum-accessible data structure. The dependency on the ducible source of error that we indicate with the letter µ. If Frobenius norm implies that it is possible to obtain a speedup the irreducible error is zero than we say that is universal. H only when X is low-rank (but non-sparse). Due to approxima- For simplicity, we assume throughout the paper that µ = 0. tion errors, the outputof the algorithmis not w but a quantum From the error decomposition in (10) we see that in order state w˜ , such that w˜ w γ. | i to have an algorithm with optimal statistical performance we | i k| i−| ik ≤ 4 must make sure that the optimisation error is not larger than Algorithm Train time Test time the estimation error. Therefore the optimisation error must Classical SVM / KRR n3 n scale at most as the best estimation error. If it does, we say 2 that the optimisation error matches the boundof the estimation KRR [34–38] n n error. Divide and conquer [24] n2 n In order to make the notion of matching the bound more Nystr¨om [27, 28] n2 √n concrete, let us consider again the case of least squares. The 1 3 FALKON [39] n √n √n closed form solution w = X− b requires O(n ) timeto becom- puted and attains essentially zero optimisation error. Because Quantum QKLS / QKLR [7] √n n √n the total error is dominated by the 1/ √n term of the estima- QSVM [3] n √n n2 √n tion error, one may wonder about the convenience of paying a cost of order O(n3) to achieve zero optimisation error. A care- ful analysis shows that this is indeed not a convenient choice TABLE I. Summary of time complexities for training and testing of and it is possible to design algorithms that are less accurate different classical and quantum algorithms when statistical guaran- but converge faster to estimators that, albeit not attaining zero tees are taken into account. We omit polylog(n, d) dependencies for optimisation error, achieve an error that matches the bound— the quantum algorithms. We assume that the generalisation error scales as Θ(1/ √n) and count the effects of measurement errors. The this is the approach taken by early stopping, divide and con- acronyms in the table refer to: support vector machines (SVM), ker- quer, and random sub-sampling methods. For many quantum nel ridge regression (KRR), quantum kernel least squares (QKLS), algorithms, such as some of the quantum linear regression and quantum kernel linear regression (QKLR), and quantum support vec- least squares algorithms we discussed in the previous section tor machines (QSVM). Note that for quantum algorithms the state (e.g. [3, 5]), the time complexity depends inverse polynomi- obtained after training cannot be maintained or copied and the algo- ally on the error and the matching procedure has important rithm must be retrained after each test round. This brings a factor consequences. In the next section we discuss these implica- proportional to the train time in the test time of quantum algorithms. Because the condition number may also depend on n and for quan- tions and show that, in order to obtain an optimisation error tum algorithms this dependency may be worse, the overall scaling of that scales at most as the best estimation error, one should ex- the quantum algorithms may be slower than the classical. pect to pay a computational price which is polynomial in n. For quantum algorithms with polylogarithmic error depen- dency, such as [7], the optimisation error is lower than the esti- mation error and therefore there are no bounds to be matched. In this case, we show that quantum algorithms argument can- in (10) in terms of the approximation error of the quantum not achieve polylogarithmic runtime in the dimension of the algorithm we consider the following decomposition between training set based on an argument that analyses the error de- the ideal minimizer of the empirical risk fˆ and the approxi- pendency introduced via the finite sampling process that is mate minimizer fˆγ, output of the learning algorithm required to extract a classical output from the algorithm. This will be discussed in a later section. We begin by discussing the dependency on the error and then proceed to discuss the dependency on the measurement ( fˆγ) ( fˆ) E −E errors. We summarise the results of our analysis in Table I. = ( fˆγ) ˆ( fˆγ) + ˆ( fˆγ) ˆ( fˆ) + ˆ( fˆ) ( fˆ) E − E E − E E −E Error dependency of the quantum algorithms. In this section Generalisation error Algorithmic error Generalisation error | {z } we show that in orderto have a total error (see (10)) that scales | {z } | {z } (12) 1/2 as1/ √n we must introduce a polynomial n-dependency in the =Θ(n− ) + ˆ( fˆγ) ˆ( fˆ), (13) E − E quantum algorithm. For simplicity, we present our argument Algorithmic error by discussing the case of quantum least squares algorithms | {z } with inverse polynomial dependency on the error [4, 5, 29]. Our results generalize easily generalise for all kernel methods. For a γ error guarantee on the final output state, the quan- where the first and third contributions result from the general- tum algorithms we consider have a time complexity that scales isation error bounds and the second is the approximation error c β as O(κ γ− polylog(n)) for some β, c > 0. For example, β = 3 of the quantum algorithm. In order to achieve the best statis- in of [5] and β = 4 in [40]. tical performance the algorithmic error must scale at worst as 1/2 Since for the quantum algorithm the data matrix needs ei- the worst statistical error, that is ˆ( fˆγ) ˆ( fˆ) = O(n− ). ther to be Hermitian or encoded in a larger Hermitian matrix E − E such that the dimensionality of the matrix is n d for n data Let us analyse the algorithmic error term for the problem × points in Rd, we assume here for simplicity that the data is of linear regression and least squares problem. Assuming that given by a n n Hermitian matrix, i.e., n points in Rn. the output of the quantum algorithm is a state w˜ while the × In order give a precise bound to the optimisation error term exact minimizer of the empirical risk is w , with| wi˜ w | i k| i−| ik ≤ 5

γ, we find that (assuming X and Y are bounded) process | | | | n T T 1 T 2 T 2 y yˆ = w x wˆ x (19) ˆ( fˆǫ) ˆ( fˆ) w˜ xi yi w xi yi (14) | − | | − | |E − E |≤ n − − − Xi=1     w w˜ + τ x (20) ≤ k − k k k n (γ + τ) x (21) 1 T L (w˜ w) xi (15) ≤ k k ≤ n − Xi=1 where we used Cauchy-Schwarz and w w˜ γ. n k − k≤ 1 By virtue of (12), we have that, if we want an algorithm L w˜ w xi k γ, (16) ≤ n k − k k k≤ · that attains the best statistical accuracy for the number of sam- Xi=1 ples contained in the training set, we need to make sure that where k > 0 is a constant and the inequality comes from the contribution coming from the measurement error scales Cauchy-Schwarz and the fact that, because X and Y are at most as the worst possible generalisation error. Recalling | | | | bounded, we have that, for the square loss ℓsq, the following that the generalisation error scales as Θ(1/ √n) we have that inequality holds ℓ ( f (x ), y ) ℓ ( f (x ), y ) L ( f (x ) τ = O(1/ √n), from which it follows that m = Ω( √n). This | sq 1 1 − sq 2 2 | ≤ | 1 − y1) ( f (x2) y2) for some L > 0. lower bound on the number of measurement required to ex- In− order to− have| an algorithm that achieves the best possible tract a classical estimate of the output effectively sets a Ω( √n) statistical accuracy, we need the algorithmic error to scale at lower bound on the time complexity of all supervised quantum worst as the statistical error—this can be obtained by setting machine learning algorithms. 1/2 γ = n− . In this case, the time complexity of quantum least If we consider this lower bound, classical algorithms can squares becomes have time complexities matching those of the quantum al- gorithms. For an more detailed comparison of the runtime c β/2 O κ n log(n) , (17) of popular classical and quantum algorithms for supervised   learning problems see Table I. for some constant c. Conclusions. Quantum machine algorithms promise to be ex- Measurement errors in quantum algorithms. So far we have ponentially faster than classical methods. In this paper, we ignored the error introduced by the measurement process used use standard results from statistical learning theory to rule out to compute a classical estimate of the output of the quantum quantum machine algorithms with polylogarithmic time com- algorithm. In practice, this corresponds to the estimation of plexity in the input dimensions. Considering that almost any expected values of quantum operators. With a classical statis- current and practically used machine learning algorithm has tical analysis of the errors—and assuming the measurements polynomial runtime, our results warn against the possibility of are statistically independent—it is possible to show, using the superpolynomial advantages for supervised quantum machine central limit theorem, that the estimation error for a quan- learning. We remark two limitations of our analysis. First, tum expected value scales as 1/ √m, where m is the number our results do not rule out exponential advantages over clas- of measurements [41]. This is known as the standard quan- sical algorithms with superpolynomial runtime. Second, we tum limit or the shot-noise limit. Using techniques developed do not make assumptions on the hypothesis space; using prior within the field of quantum metrology it is often possible to knowledge it is possible get error rates that converge faster overcome this limit—using the same physical resources and than 1/ √n. the addition of quantum effects such as entanglement—and Our argument leverages the fact that the statistical error of obtain a precisionthat scales as 1/m. It is possible to show that the algorithm has a provable polynomial dependence on the this is the ultimate limit to measurement precision and follows number of samples in the training set. Since the statistical er- directly from the Heisenberg uncertainty principle [41, 42]. ror and the approximation error of the algorithm are additive, In this section we analyse the contribution of the measure- in order to achieve the best possible error rate, the asymptotic ment error to the time complexity of quantum learning al- scaling of the statistical error must match that of the approx- gorithms. Let us consider again the case of quantum least imation error. This matching forces the approximation error squares. The (quantum) output of the algorithm is the state of quantum algorithms to scale polynomially with the num- w˜ , an approximation (due to algorithmic errors) of the ideal ber of samples. This effectively kills quantum speedups for |outputi w . Using techniques such as quantum state tomog- algorithms that have polynomial dependence on the error. raphy we| i can produce a classical estimatew ˆ of the vectorw ˜ For algorithms where the dependency on the error is loga- with accuracy rithmic, this argument does not apply. In this case, we show w˜ wˆ τ =Ω(1/m), (18) that the sampling error coming from the measurement process k − k≤ also adds up additively to the total error and this introduces where m is the number of measurements performed for the a polynomial dependency in the number of samples that kills estimation of the expected values on w˜ . the superpolynomial speedup. Let y be the ideal prediction. We have| i two sources of error, Notably, our results hold even assuming that quantum al- the algorithmic error and the error coming from the estimation gorithms can access a quantum data structure at no cost. In 6 this respect, we prove a stronger ‘no-go’ result for quantum learning: From theory to algorithms (Cambridge university learning than the one proved by Tang in [9]. Indeed, the latter press, 2014). relies on a classical data structure that mimics a quantum data [17] V. N. Vapnik and V. Vapnik, Statistical learning theory, Vol. 1 structure but is unrealistic in practice. (Wiley New York, 1998). [18] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, As future directions, it is worth mentioning that it may be Journal of the ACM (JACM) 36, 929 (1989). possible strengthen our results by analysing the n dependency [19] F. Cucker and S. Smale, Bulletin of the American mathematical of the condition number. Previous results in this direction are society 39, 1 (2002). discussed in [19, 43]. [20] C. M. Bishop, and machine learning The authors would like to thank Aram Harrow, Sathya Sub- (springer, 2006). ramanian, and Maria Schuld for helpful feedback on an earlier [21] F. Bauer, S. Pereverzev, and L. Rosasco, Journal of complexity draft of the article, Shantanav Chakraborty and Stacey Jeffrey 23, 52 (2007). for discussions on the Frobenius norm dependency of their [22] C. Rasmussen and C. Williams, Gaussian Processes for Ma- chine Learning (2006). quantum least squares algorithm, and Alessandro Davide Ia- [23] J. Shawe-Taylor, N. Cristianini, et al., Kernel methods for pat- longo for helpful comments. This research was supported in tern analysis (Cambridge university press, 2004). part by the National Science Foundation under Grant No. NSF [24] Y. Zhang, J. Duchi, and M. Wainwright, in Conference on PHY-1748958 and by the Heising-Simons Foundation. A.R. Learning Theory (2013) pp. 592–617. is supported by the Simons Foundation through It from Qubit: [25] A. Rahimi, B. Recht, et al., in NIPS, Vol. 3 (2007) p. 5. Simons Collaboration on Quantum Fields, Gravity, and Infor- [26] A. J. Smola and B. Sch¨olkopf (Morgan Kaufmann, 2000) pp. mation. L.W. is supported by a Google PhD Fellowship. 911–918. [27] C. K. Williams and M. Seeger, in Advances in neural informa- tion processing systems (2001) pp. 682–688. [28] A. Rudi, R. Camoriano, and L. Rosasco, in Advances in Neural Information Processing Systems (2015) pp. 1657–1665. [29] G. Wang, Physical review A 96, 012335 (2017). [1] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, [30] D.-B. Zhang, S.-L. Zhu, and Z. Wang, arXiv preprint and S. Lloyd, Nature 549, 195 (2017). arXiv:1808.09607 (2018). [2] C. Ciliberto, M. Herbster, A. D. Ialongo, M. Pontil, A. Roc- [31] A. W. Harrow, A. Hassidim, and S. Lloyd, Physical review chetto, S. Severini, and L. Wossnig, Proceedings of the Royal letters 103, 150502 (2009). Society A: Mathematical, Physical and Engineering Sciences [32] S. Lloyd, M. Mohseni, and P. Rebentrost, Nature Physics 10, 474, 20170551 (2018). 631 (2014). [3] P. Rebentrost, M. Mohseni, and S. Lloyd, Physical review let- [33] S. Kimmel, C. Y.-Y. Lin, G. H. Low, M. Ozols, and T. J. Yoder, ters 113, 130503 (2014). npj Quantum Information 3, 13 (2017). [4] N. Wiebe, D. Braun, and S. Lloyd, Physical review letters 109, [34] Y. Yang, M. Pilanci, M. J. Wainwright, et al., The Annals of 050505 (2012). Statistics 45, 991 (2017). [5] M. Schuld, I. Sinayskiy, and F. Petruccione, Physical Review [35] S. Ma and M. Belkin, in Advances in Neural Information Pro- A 94, 022342 (2016). cessing Systems (2017) pp. 3778–3787. [6] I. Kerenidis and A. Prakash, arXiv preprint arXiv:1704.04992 [36] A. Gonen, F. Orabona, and S. Shalev-Shwartz, in International (2017). Conference on Machine Learning (2016) pp. 1397–1405. [7] S. Chakraborty, A. Gily´en, and S. Jeffery, arXiv preprint [37] H. Avron, K. L. Clarkson, and D. P. Woodruff, SIAM Journal arXiv:1804.01973 (2018). on Matrix Analysis and Applications 38, 1116 (2017). [8] S. Aaronson, Nature Physics 11, 291 (2015). [38] G. E. Fasshauer and M. J. McCourt, SIAM Journal on Scientific [9] E. Tang, arXiv preprint arXiv:1807.04271 (2018). Computing 34, A737 (2012). [10] N.-H. Chia, H.-H. Lin, and C. Wang, arXiv preprint [39] A. Rudi, L. Carratino, and L. Rosasco, in Advances in Neural arXiv:1811.04852 (2018). Information Processing Systems (2017) pp. 3888–3898. [11] N.-H. Chia, T. Li, H.-H. Lin, and C. Wang, arXiv preprint [40] T. Li, S. Chakrabarti, and X. Wu, in Proceedings of the 36th arXiv:1901.03254 (2019). International Conference on Machine Learning, Vol. 97, edited [12] A. Gily´en, S. Lloyd, and E. Tang, arXiv preprint by K. Chaudhuri and R. Salakhutdinov (PMLR, Long Beach, arXiv:1811.04909 (2018). California, USA, 2019) pp. 3815–3824. [13] N.-H. Chia, A. Gily´en, T. Li, H.-H. Lin, E. Tang, and C. Wang, [41] V. Giovannetti, S. Lloyd, and L. Maccone, Science 306, 1330 arXiv preprint arXiv:1910.06151 (2019). (2004). [14] A. B. Grilo, I. Kerenidis, and T. Zijlstra, Physical Review A [42] V. Giovannetti, S. Lloyd, and L. Maccone, Physical review let- 99, 032314 (2019). ters 96, 010401 (2006). [15] V. Kanade, A. Rocchetto, and S. Severini, Quantum Informa- [43] H. Hochstadt, Integral equations, Vol. 91 (John Wiley & Sons, tion and Computation (2019). 2011). [16] S. Shalev-Shwartz and S. Ben-David, Understanding machine