![Arxiv:2001.10477V3 [Quant-Ph]](https://data.docslib.org/img/3a60ab92a6e30910dab9bd827208bcff-1.webp)
Statistical Limits of Supervised Quantum Learning Carlo Ciliberto,1 Andrea Rocchetto,2, 3 Alessandro Rudi,4 and Leonard Wossnig5, 6 1Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2BT, United Kingdom 2Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA 3Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106, USA 4INRIA - Sierra project team, Paris 75012, France 5Department of Computer Science, University College London, London WC1E 6EA, United Kingdom 6Rahko Limited, N4 3JP London, United Kingdom Within the framework of statistical learning theory it is possible to bound the minimum number of samples required by a learner to reach a target accuracy. We show that if the bound on the accuracy is taken into account, quantum machine learning algorithms for supervised learning—for which statistical guarantees are available—cannot achieve polylogarithmic runtimes in the input dimension. We conclude that, when no further assumptions on the problem are made, quantum machine learning algorithms for supervised learning can have at most polynomial speedups over efficient classical algorithms, even in cases where quantum access to the data is naturally available. A wide class of quantum algorithms for supervised learning variants. problems(where the goal is to infer a mapping given examples Finally, we note that our results do not assume any prior of an input-outputrelation) exploit fast quantum linear algebra knowledge on the function to be learned. This allows us to subroutines to achieve runtimes that are exponentially faster make statements on virtually every possible learning algo- than their classical counterparts [1, 2]. Examples of these al- rithm, including neural networks. Using stronger assumptions gorithms are quantum support vector machines [3], quantum on the target function it is possible to improve the dependency linear regression [4, 5], and quantum least squares [6, 7]. of the accuracy in number of samples (consider the limit case A careful analysis of these algorithms identified a number where the function is known, here zero samples can determine of caveats that limit their practical applicability such as the the function with maximum accuracy). need for a strong form of quantum access to the input data, restrictions on structural properties of the data matrix (such Statistical Learning Theory. The field of statistical learning as condition number or sparsity), and modes of access to the theory investigates how to quantify the statistical resources output [8]. Furthermore, if one assumes that it is efficient to required to solve a learning problem [16]. In this work, we (classically) sample elements of the training data in a way pro- consider supervised learning settings where the goal is to find portional to their norm, then it is possible to show that classi- a model that fits well a set of input-output training examples cal algorithms are only polynomiallyslower (albeit the scaling but that, more importantly, guarantees good prediction perfor- of the quantum algorithms can be considerably better) [9–13]. mance on new observations. This latter property, also known In this paper we continue to investigate the limitations of as generalisation capability of the learned model, is a key as- quantum algorithms for supervised learning problems. Our pect separating machine learning from the standard optimisa- analysis focuses on the dependency on the size of the data set tion literature. Indeed, while data fitting is often approached that is introduced when considering the statistical guarantees as an optimisation problem in practice, the focus of machine of the estimators. The key elements of our work are a series learning is to design statistical estimators able to ‘fit’ well fu- of well known results in statistical learning theory that show ture examples. More formally, let ρ be a distribution over X Y, with X a arXiv:2001.10477v3 [quant-ph] 29 Oct 2020 how the accuracy parameter of a supervised learning problem × scales inverse polynomially with the number of samples in the so-called domain (or input) set and Y a label (or output) set. training set. We leverage on these insights to show that quan- The goal of supervised learning is to produce a hypothesis f : X Y such that the expected risk or expected error tum learning algorithms must have at least polynomial run- → time in the dimension of the training data and therefore can- ( f ) := Eρ ℓ(y, f (x)) (1) not achieve exponential speedups over classical polynomial E time machine learning algorithms. We remark that our results is small with respect to a suitable loss function ℓ : Y Y R do not rule out exponential advantages for learning problem measuring prediction errors. However, in practice, the× tar→get where no efficient classical algorithms are known. In fact, in distribution ρ is unknown and only accessible by means of a this regime, there exist learning problems for which quantum finite training set S n = (xi, yi), i = 1,..., n of i.i.d. points algorithms have a superopolynomial advantage [14, 15]. sampled from it. { } Our results are independent of the modes of access to the Depending on whether the label set Y is dense or discrete training data, that is, even if the data set is naturally stored the task is called regression (dense) or classification (discrete). in a quantum structure, quantum machine learning algorithms Typical loss functions are the quadratic loss ℓsq( f (x), y) = can have at most polynomial advantage over their classical ( f (x) y)2 over Y = R for regression and the 0 1 loss − − 2 3 ℓ0 1( f (x), y) = 1 f (x),y over Y = 1, 1 for classification. or SVM, the computational time is therefore (n ), which is −Different machine learning frameworks{− } have different pre- similar to the time it requires to invert a squareO matrix that has scriptions on how to choose the hypothesis f . The Empirical size equal to the number n of examples in the training set. No- Risk Minimisation (ERM) framework prescribes to choose a tably this can be improved depending on the sparsity and the hypothesis that minimises the empirical risk conditioning of the specific optimisation problem. To reduce the computational cost, instead of considering 1 inf ˆ( f ), ˆ( f ) = ℓ(yi, f (xi)), (2) the optimisation problem as a separate process from the sta- f E E n ∈H (xX,y ) T tistical one, more recent methods hinge on the intuition that i i ∈ reducing the computational burden of the learning algorithm over a suitable hypotheses space . Under weak assumptions can be interpreted as a form of regularisation on its own. For on (for instance a bounded subsetH of a Hilbert space [16]). H instance, early stopping approaches are now widely used in it is possible to guarantee the existence of a minimizer for (2) practice, and perform only a limited number of steps of an ˆ ˆ that we denote f = arg minf ( f ). iterative optimisation algorithm, to avoid overfitting the train- ∈H E The difference between risk and empirical risk is called ing set. They thereby entail less operations, while provably generalisation error and plays a central role in statistical maintaining the same generalisation error of approaches such learning theory. Indeed, when (1) admits a minimizer in , H as Tikhonov regularisation [21]. More specifically, prototyp- we have ical results (such as [21]) show that the number of iterations required are of the order of 1/λ where λ is the ideal regulari- ˆ ˆ ( f ) inf ( f ) 2 sup ( f ) ( f ) . (3) sation parameter that one would use for ERM. Therefore, if in E − f E ≤ f E −E ∈H ∈H the worst case scenario λ = O(1/ √n), early stopping would In other words, the excess risk incurred by the empirical risk attain (up to constants) the same generalisation error of regu- minimizer is controlled by the worse generalisation error over larised ERM by performing only √n iterations. A fundamentalresult in statistical learning theory [16–18], A different approach, also known as divide and conquer, is H often referred in the literature as the fundamental theorem of based on the idea of distributing portions of the training data statistical learning, is that for every n N, δ (0, 1), and ev- onto separate machines, each solving a smaller learning prob- ∈ ∈ ery distribution ρ, the following holds with probability larger lem, and then combining individualpredictors into a joint one. than 1 δ This computation hence benefits from both the parallelisation − and the reduced dimension of distributed datasets while simi- c ( ) + log(1/δ) sup ˆ( f ) ( f ) Θ H , (4) larly maintaining statistical guarantees [24]. r f E −E ≤ n A third approach that has recently received significant at- ∈H tention from the machine learning community, along with where c ( ) is a measure of the complexity of (such as the the quantum computing community, is based on random sub- VC dimension,H covering numbers, RademacherH complexity to sampling, a form of dimensionality reduction. Depending on name a few [16, 19]). Intuitively, the dependency on c( ) how such sampling is performed, different methods have been in (4) models the phenomenon known as overfitting in whichH proposed, the most well-known being random features [25] a large hypothesis space incurs in low training (empirical) er- and Nystr¨om approaches [26, 27]. Here the computational ad- ror but performs poorly on the true risk. This problem can vantage is clearly given by the smaller dimensionality of the be addressed with so-called regularisation techniques, which hypothesesspace, and it has also recently been shown that it is essentially limit the expressive power of the learned estimator possible to obtain equivalent generalisation error to classical in order to avoid overfitting the training dataset.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-