Quantum Machine Learning For Classical Data
Leonard P. Wossnig
A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. arXiv:2105.03684v2 [quant-ph] 12 May 2021 Department of Computer Science University College London
May 13, 2021 2
I, Leonard P. Wossnig, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Abstract
In this dissertation, we study the intersection of quantum computing and supervised machine learning algorithms, which means that we investigate quantum algorithms for supervised machine learning that operate on classical data. This area of re- search falls under the umbrella of quantum machine learning, a research area of computer science which has recently received wide attention. In particular, we in- vestigate to what extent quantum computers can be used to accelerate supervised machine learning algorithms. The aim of this is to develop a clear understanding of the promises and limitations of the current state-of-the-art of quantum algorithms for supervised machine learning, but also to define directions for future research in this exciting field. We start by looking at supervised quantum machine learning (QML) algorithms through the lens of statistical learning theory. In this frame- work, we derive novel bounds on the computational complexities of a large set of supervised QML algorithms under the requirement of optimal learning rates. Next, we give a new bound for Hamiltonian simulation of dense Hamiltonians, a major subroutine of most known supervised QML algorithms, and then derive a classical algorithm with nearly the same complexity. We then draw the parallels to recent ‘quantum-inspired’ results, and will explain the implications of these results for quantum machine learning applications. Looking for areas which might bear larger advantages for QML algorithms, we finally propose a novel algorithm for Quantum Boltzmann machines, and argue that quantum algorithms for quantum data are one of the most promising applications for QML with potentially exponential advantage over classical approaches. Acknowledgements
I want to thank foremost my supervisor and friend Simone Severini, who has always given me the freedom to pursue any direction I found interesting and promising, and has served me as a guide through most of my PhD. Next, I want to thank Aram Harrow, my secondary advisor, who has always been readily available to answer my questions and discuss a variety of research topics with me. I also want to thank Carlo Ciliberto, Nathan Wiebe, and Patrick Rebentrost, who have worked closely with me and have also taught me most of the mathematical tricks and methods upon which my thesis is built. I furthermore want to thank all my collaborators throughout the years. These are in particular Chunhao Wang, Andrea Rocchetto, Marcello Benedetti, Alessan- dro Rudi, Raban Iten, Mark Herbster, Massimiliano Pontil, Maria Schuld, Zhikuan Zhao, Anupam Prakash, Shuxiang Cao, Hongxiang Chen, Shashanka Ubaru, Haim Avron, and Ivan Rungger. Almost in an equal contribution, I also want to thank Fernando Brandao,˜ Youssef Mroueh, Guang Hao Low, Robin Kothari, Yuan Su, Tongyang Li, Ewin Tang, Kanav Setia, Matthias Troyer, and Damian Steiger for many helpful discussions, feedback, and enlightening explanations. I am particularly grateful to Edward Grant, Miriam Cha, and Ian Horobin, who made it possible for me to write this thesis. I want to acknowledge UCL for giving me the opportunity to pursue this PhD thesis, and acknowledge the kind support of align Royal Society Research grant and the Google PhD Fellowship, which gave me the freedom to work on these interesting topics. Acknowledgements 5
Portions of the work that are included in this thesis were completed while I was visiting the Institut Henri Poincare´ of the Sorbonne University in Paris. I particularly want to thank Riam Kim-McLeod for the support and help with the editing of the thesis. I finally want to thank my family for the continued love and support. Impact Statement
Quantum machine learning bears promises for many areas, ranging from the health- care to the financial industry. In today’s world, where data is available in abundance, only novel algorithms and approaches are enabling us to make reliable predictions that can enhance our life, productivity, or wealth. While Moore’s law is coming to an end, novel computational paradigms are sought after to enable a further growth of processing power. Quantum computing has become one of the prominent can- didates, and is maturing rapidly. The here presented PhD thesis develops and stud- ies this novel computational paradigm in light of existing classical solutions and thereby develops a path towards quantum algorithms that can outperform classical approaches. Contents
1 Introduction and Overview 12 1.1 Synopsis of the thesis ...... 15 1.2 Summary of our contributions ...... 16 1.3 Statement of authorship ...... 20
2 Notation And Mathematical Preliminaries 23 2.1 Notation ...... 23 2.2 Matrix functional analysis ...... 24
3 Statistical Learning Theory 29 3.1 Review of key results in Learning Theory ...... 31 3.1.1 Supervised Learning ...... 32 3.1.2 Empirical risk minimization and learning rates ...... 33 3.1.3 Regularisation and modern approaches ...... 37 3.2 Review of supervised quantum machine learning algorithms . . . . 39 3.2.1 Recap: Quantum Linear Regression and Least Squares . . . 40 3.2.2 Recap: Quantum Support Vector Machine ...... 43 3.3 Analysis of quantum machine learning algorithms ...... 44 3.3.1 Bound on the optimisation error ...... 46 3.3.2 Bounds on the sampling error ...... 49 3.3.3 Bounds on the condition number ...... 50 3.4 Analysis of supervised QML algorithms ...... 57 3.5 Conclusion ...... 58 Contents 8
4 randomised Numerical Linear Algebra 60 4.1 Introduction ...... 61 4.2 Memory models and memory access ...... 62 4.2.1 The pass efficient model ...... 62 4.2.2 Quantum random access memory ...... 64 4.2.3 Quantum inspired memory structures ...... 67 4.3 Basic matrix multiplication ...... 69 4.4 Hamiltonian Simulation ...... 74 4.4.1 Introduction ...... 75 4.4.2 Related work ...... 80 4.4.3 Applications ...... 85 4.4.4 Hamiltonian Simulation for dense matrices ...... 87 4.4.5 Hamiltonian Simulation with the Nystrom¨ method . . . . . 102 4.4.6 Beyond the Nystrom¨ method ...... 126 4.4.7 Conclusion ...... 126
5 Promising avenues for QML 130 5.1 Generative quantum machine learning ...... 131 5.1.1 Related work ...... 132 5.2 Boltzmann and quantum Boltzmann machines ...... 133 5.3 Training quantum Boltzmann machines ...... 135 5.3.1 Variational training for restricted Hamiltonians ...... 137 5.3.2 Gradient based training for general Hamiltonians ...... 143 5.4 Conclusion ...... 147
6 Conclusions 150
Appendices 156
A Appendix 1: Quantum Subroutines 156 A.1 Amplitude estimation ...... 156 A.2 The Hadamard test ...... 157 Contents 9
B Appendix 2: Deferred proofs 159 B.1 Derivation of the variational bound ...... 159 B.2 Gradient estimation ...... 161 B.2.1 Operationalizing the gradient based training ...... 163 B.3 Approach 2: Divided Differences ...... 170 B.3.1 Operationalising ...... 187
Bibliography 192 List of Figures
1.1 Different fields of study in Quantum Machine Learning. The dif- ferent areas are related to the choice of algorithm, i.e., whether it is executed on a quantum or classical computer, and the choice of the target problem, i.e., whether it operates on quantum or classical data. 13
3.1 Summary of time complexities for training and testing of differ- ent classical and quantum algorithms when statistical guarantees are taken into account. We omit polylog(n,d) dependencies for the √ quantum algorithms. We assume ε = Θ(1/ n) and count the ef- fects of measurement errors. The acronyms in the table refer to: least square support vector machines (LS-SVM), kernel ridge re- gression (KRR), quantum kernel least squares (QKLS), quantum kernel linear regression (QKLR), and quantum support vector ma- chines (QSVM). Note that for quantum algorithms the state ob- tained after training cannot be maintained or copied and the algo- rithm must be retrained after each test round. This brings a factor proportional to the train time in the test time of quantum algorithms. Because the condition number may also depend on n and for quan- tum algorithms this dependency may be worse, the overall scaling of the quantum algorithms may be slower than the classical. . . . . 59
4.1 An example of the data structure that allows for efficient state prepa- ration using a logarithmic number of conditional rotations...... 66 List of Figures 11
4.2 An example of the classical (dynamic) data structure that enables × efficient sample and query access for the example of H ∈ C2 4... 69 √ 4.3 Comparing our result O(t d kHk polylog(t,d,kHk,1/ε)) with other quantum and classical algorithms for different models. Since the qRAM model is stronger than the sparse-access model and the classical sampling and query access model, we consider the advan- tage of our algorithm against others when they are directly applied to the qRAM model...... 86
5.1 Comparison of previous training algorithms for quantum Boltz- mann machines. The models have a varying cost function (objec- tive), contain (are able to be trained with) hidden units, and have different input data (classical or quantum)...... 132 Chapter 1
Introduction and Overview
In the last twenty years, due to increased computational power and the availability of vast amounts of data, machine learning (ML) has seen an immense success, with applications ranging from computer vision [1] to playing complex games such as the Atari series [2] or the traditional game of Go [3]. However, over the past few years, challenges have surfaced that threaten the end of this revolution. The main two challenges are the increasingly overwhelming size of the available data sets, and the end of Moore’s law [4]. While novel developments in hardware architectures, such as graphics processing units (GPUs) or tensor processing units (TPUs), enable orders of magnitude improved performance compared to central processing units (CPUs), they cannot significantly improve the performance any longer, as they also reach their physical limitations. They are therefore not offering a structural solution to the challenges posed and new solutions are required.
On the other hand, a new technology is slowly reaching maturity. Quantum computing, a form of computation that makes use of quantum-mechanical phenom- ena such as superposition and entanglement, has been predicted to overcome these limitations on classical hardware. Quantum algorithms, i.e., algorithms that can be executed on a quantum computer, have been investigated since the 1980s, and have recently received increasing interest all around the world.
One area that has received particular attention is quantum machine learning see e.g. [5], the combination of quantum mechanics and machine learning. Quantum machine learning can generally be divided into four distinct areas of research: 13
CC CQ
QC QQ Problem/Data
Algorithm/computer
Figure 1.1: Different fields of study in Quantum Machine Learning. The different areas are related to the choice of algorithm, i.e., whether it is executed on a quantum or classical computer, and the choice of the target problem, i.e., whether it operates on quantum or classical data.
• Machine learning algorithms that are executed on classical computers (CPUs, GPUs, TPUs) and applied to classical data, the sector CC in Fig. 1.1,
• Machine learning algorithms that are executed on quantum computers (QPUs) and applied to classical data, the sector QC in Fig. 1.1,
• Machine learning algorithms that are executed on classical computers and applied to quantum data, the sector CQ in Fig. 1.1, and
• Machine learning algorithms that are executed on quantum computers (QPUs) and applied to quantum data, the sector QQ in Fig. 1.1.
The biggest attention has been paid to quantum algorithms that perform su- pervised machine learning on classical data. The main objective in designing such quantum algorithms is to achieve a computational advantage over any classical al- gorithm for the same task. This area is the main focus of this thesis, and we will for brevity throughout this thesis refer to this area when we speak about QML. Although becoming an increasingly popular area of research, supervised QML has in recent years faced two major challenges. 14
Firstly, initial research was aimed at designing supervised ML algorithms that have a superior performance, where performance was measured solely in terms of the computational complexity (i.e., theoretical speed) of the algorithm with respect to the input size (the problem dimension) and the desired error (or accuracy). When we are speaking about the error in this context, then we generally refer to the dis- tance (for example in norm) of the solution that we obtain through the algorithm to the true solution. For example, for most algorithms we obtain guarantees that if we run it in time ε−α , then the solution will be ε-close. Assuming that the solution is a vector x, and our algorithm produces the vectorx ˜, this would then imply that kx˜− xk ≤ ε. Note that different algorithms measure this error differently, but in general we use the spectral or Frobenius norm as measure for the error throughout this thesis. However, a challenge to this view is that it is today common wisdom in the classical ML community that faster computation is not necessarily the solution to most practical problems in machine learning. Indeed, more data and the ability to extrapolate to unseen data is the most essential component in order to achieve good performance. Mathematically, the behaviour of algorithms to extrapolate to unseen data and other statistical properties have been formalised in the field of statistical learning theory. The technical term of this ability is the so-called generalisation error of an algorithm, and it is possible to obtain bounds on this error for many common machine learning algorithms. The first research question of this thesis is therefore the following:
Research Question 1 (QML under the lens of SLT). Under the common assump- tions of statistical learning theory, what is the performance of supervised quantum machine learning algorithms?
Most quantum algorithms are described in the so-called query model. Here, the overall computational complexity, i.e., the number of steps an algorithm requires to completion, is given in terms of a number of calls (uses) of a different algorithm, which is called the oracle. Indeed, most known quantum machine learning algo- rithms heavily rely on such oracles in order to achieve a ‘quantum advantage’. A 1.1. Synopsis of the thesis 15 second challenge is therefore posed by the question how much of the quantum ad- vantage stems from the oracles and how much from the algorithms themselves. Concretely, most supervised QML methods assume the existence of a fast (log(n) for input dimension n complexity) data preparation oracle or procedure, a quan- tum equivalent of a random access memory, the qRAM [6, 7]. A closer analysis of these algorithms indeed implies that much of their power stems from this oracle, which indicates that classical algorithms can potentially achieve similar computa- tional complexities if they are given access to such a device. The second research question of this thesis is therefore the following:
Research Question 2 (QML under the lens of RandNLA). Under the assumption of efficient sampling processes for the data for both classical and quantum algorithms, what is the comparative advantage of the latter?
The goal of this thesis is therefore to investigate the above two challenges to supervised QML, and the resulting research questions. We thereby aim to assess the advantages and disadvantages of such algorithms from the perspective of statistical learning theory and randomised numerical linear algebra.
1.1 Synopsis of the thesis The thesis is structured as follows. We begin in Chapter 2 with notation and mathe- matical preliminaries of the thesis. In Chapter 3, we discuss the first challenge and associated Research Question 1, namely the ability of quantum algorithms to gen- eralise to data that is not present in the training set. Limitations arise through the fundamental requirement of the quantum measurement, and the error-dependency of most supervised quantum algorithms. We additionally discuss the condition num- ber, an additional dependency in many supervised QML algorithms. In Chapter 4 we discuss the second challenge, i.e., Research Question 2 re- garding input models. For this we discuss first how to access data with a quantum computer and how these access models compare to classical approaches, in partic- ular to randomised classical algorithms. This also includes a brief discussion about the feasibility of such memory models as well as potential advantages and disadvan- 1.2. Summary of our contributions 16 tages. We then propose a new algorithm for Hamiltonian simulation [8], a common subroutine of many supervised QML algorithms. Next, we propose a randomised classical algorithm for the same purpose, and then use this to discuss and compare the requirements of both algorithms, and how they relate to each other. Based on these insights, we next link our results to subsequent ‘dequantization’ results. This allows us to finally also make claims regarding the limitations of QML algorithms that are based on such memory models (or more generally such oracles). In Chapter 5, we argue that generative quantum machine learning algorithms that are able to generate quantum data (QQ in Fig. 1.1), are one of the most promis- ing approaches for future QML research, and we propose a novel algorithm for the training of Quantum Boltzmann Machines (QBMs) [9] with visible and hidden nodes. We note that quantum data is here defined as data that is generated by an arbitrarily-sized quantum circuit. We derive error bounds and computational com- plexities for the training QBMs based on two different training algorithms, which vary based on the model assumptions. Finally, in Chapter 6, we summarise our insights and discuss the main results of the thesis, ending with a proposal for further research.
1.2 Summary of our contributions Our responses to the above-posed research questions are summarised below. For Research Question 1, we obtain the following results:
• Taking into account optimal learning rates and the generalisation ability, we show that supervised quantum machine learning algorithms fail to achieve exponential speedups.
• Our analysis is based on the polynomial error-dependency and the repeated measurement that is required to obtain a classical output. We have the re- quirement that the (optimisation) error of the algorithm matches the scaling of the statistical (estimation) error that is inherent to all data-based algorithms √ and is of O(1/ n), for n being the number of samples in the data set. We observe that the performance of the algorithms in question can in practice be 1.2. Summary of our contributions 17
worse than the fastest classical solution.
• Such concerns are important for the design of quantum algorithms but are not relevant for classical ones. The computational complexity of the latter scales typically poly-logarithmic in the error ε and therefore only introduces additional O(log(n)) terms. We however acknowledge this is not generally true and it is possible to trade-off speed versus accuracy, which is for exam- ple used in early stopping. Here, one chooses to obtain a lower accuracy in order to obtain an algorithm that converges faster to the solution with the best statistical error.
• One additional challenge for many quantum algorithms is the problem of pre- conditioning, which most state-of-the-art classical algorithms such as FAL- CON [10] do take into account.
• While previous results [11] claimed that preconditioning is possible, in the quantum case, this turns out not to be true in general as, e.g. Harrow and La Placa [12] showed. Our bounds on the condition number indicate that ill-conditioned QML algorithms are even more prone to be outperformed by classical counterparts.
For Research Question 2, we obtain the following results:
• We first show that data access oracles, such as quantum random access mem- ory [6] can be used to construct faster quantum algorithms for dense Hamil- tonian simulation, and obtain an algorithm with performance (runtime) de- pending on the square-root of the dimension n of the input Hamiltonian, and √ n×n linearly on the spectral norm, i.e., O( nkHk) for Hamiltonian H ∈ C . √ For a s-sparse Hamiltonian, this reduces to time O( skHk), where s is the maximum number of non-zero elements in the rows of H.
• We next show how we can derive a classical algorithm for the same task, which is based on a classical Monte-Carlo sampling method called Nystrom¨ n×n method. We show that for a sparse Hamiltonian H ∈ C , there exists an 1.2. Summary of our contributions 18
algorithm that, with probability 1 − δ approximates any chosen amplitude of the state eiHtψ, for an efficiently describable state ψ in time
! t9 kHk4 kHk7 1 2 O sq + F log(n) + log , ε4 δ
where ε determines the quality of the approximation, k·kF is the Frobenius norm, and q is the number of non-zero elements in ψ.
• While our algorithm is not generally efficient due to the dependency on the Frobenius norm, it still removes the explicit dependency on n up to logarith- mic factors, i.e., the system dimension. We therefore obtain an algorithm that only depends on the rank, sparsity, and the spectral norm of the prob- √ lem, since kHkF ≤ r kHk, for r being the rank of H. For low-rank ma- trices, we therefore obtain a potentially much faster algorithm, and for the case q = s = r = O(polylog(n)) our algorithm becomes efficiently executable. Note that we here generally assume that N grows exponentially in the number of qubits in the system, i.e., we only obtain an efficient algorithm if we do not have an explicit n dependency.
• This result is interesting in two different ways: Firstly, it is the first known result that applies so-called sub-sampling methods for simulating (general) Hamiltonians on a classical computer. Secondly, and more important in con- text of this thesis, our result indicates that the advantage of so-called quantum machine learning algorithms may not be as big as promised. QML algorithms such as quantum principal component analysis [13], or quantum support vec- tor machines [14] were claimed to be efficient for sparse or low rank input data, and our classical algorithm for Hamiltonian simulation is efficient un- der similar conditions. As a corollary, we indeed show that we can efficiently simulate exp(iρt) for density matrix ρ, if we can efficiently sample from the rows and columns of ρ according to a certain probability distribution.
• While we did not manage to extend our results to quantum machine learn- 1.2. Summary of our contributions 19
ing algorithms, shortly after we posted our result to the ArXiv, Ewin Tang used similar methods to ‘dequantize’ [15] the quantum recommendation sys- tems algorithm by Kerenidis and Prakash [16]. While our algorithm requires the input matrix to be efficiently row-computable, Tang designed a classical memory model that allows us to sample the rows or columns of the input matrix according to their norms. Under closer inspection, these requirements are fundamentally equivalent. Furthermore, the ability to sample efficiently from this distribution is indeed similar to the ability of a quantum random ac- cess memory to prepare quantum states of the rows and columns of an input matrix. Dequantization then refers to a classical algorithm that can perform the same task as a chosen quantum algorithm with an at most polynomial slowdown. In the subsequent few months, many other algorithms were pub- lished that achieved similar results for many other quantum machine learn- ing algorithms including quantum PCA [17], Quantum Linear Systems Algo- rithm [18, 19], and Quantum Semi-Definite Programming [20]. Most of these algorithms were unified recently in the framework of quantum singular value transformations [21].
A conclusion from the above results and the answers to the research ques- tions is that the hope for exponential advantages of quantum algorithms for ma- chine learning over their classical counterparts might be misplaced. While poly- nomial advantages may still be feasible, and most results currently indicate a gap, understanding the limitations of classical and quantum algorithms is still an open research question. As many of the algorithms we investigate in this thesis are of a theoretical nature, we want to mention that the ultimate performance test for any algorithm is a benchmark, and only such will give the final answers to questions of real advantage. However, this will only be possible once sufficiently large and accurate quantum computers are available, and will therefore not be possible in the near future. As our and subsequent results indicate, quantum machine learning algorithms for classical data to date appear to allow for at most polynomial speedups if any. 1.3. Statement of authorship 20
We therefore turn to an area where classical machine learning algorithms might generally be inefficient: modelling of quantum distributions, and therefore quan- tum algorithms for quantum data. We propose a method for fully quantum gen- erative training of quantum Boltzmann machines that in contrast to prior art have both visible and hidden units. We base our training on the quantum relative en- tropy objective function and find efficient algorithms for training based on gradient estimations under the assumption that Gibbs state preparation for the model Hamil- tonian is efficient. One interesting feature of these results is that such generative models can in principle also be used for the state preparation, and might therefore be a useful tool to overcome qRAM-related issues.
1.3 Statement of authorship
This thesis is based on the following research articles. Notably, for many articles the order of authors is alphabetical, which is standard in theoretical computer science.
1. Danial Dervovic, Mark Herbster, Peter Mountney, Simone Severini, Na¨ıri Usher, and Leonard Wossnig. Quantum linear systems algorithms: a primer. arXiv preprint arXiv:1802.08227, 2018
2. Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018
3. Chunhao Wang and Leonard Wossnig. A quantum algorithm for sim- ulating non-sparse hamiltonians. Quantum Information & Computation, 20(7&8):597–615, 2020
4. Alessandro Rudi, Leonard Wossnig, Carlo Ciliberto, Andrea Rocchetto, Mas- similiano Pontil, and Simone Severini. Approximating hamiltonian dynamics with the nystrom¨ method. Quantum, 4:234, 2020 1.3. Statement of authorship 21
5. Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, and Leonard Woss- nig. Fast quantum learning with statistical guarantees. arXiv preprint arXiv:2001.10477, 2020
6. Nathan Wiebe and Leonard Wossnig. Generative training of quantum boltz- mann machines with hidden units. arXiv preprint arXiv:1905.09902, 2019
I have additionally co-authored the following articles that are not included in this thesis:
1. Hongxiang Chen, Leonard Wossnig, Simone Severini, Hartmut Neven, and Masoud Mohseni. Universal discriminative quantum neural networks. arXiv preprint arXiv:1805.08654, 2018
2. Marcello Benedetti, Edward Grant, Leonard Wossnig, and Simone Severini. Adversarial quantum circuit learning for pure state approximation. New Jour- nal of Physics, 21(4):043023, 2019
3. Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum, 3:214, 2019
4. Shuxiang Cao, Leonard Wossnig, Brian Vlastakis, Peter Leek, and Edward Grant. Cost-function embedding and dataset encoding for machine learning with parametrized quantum circuits. Physical Review A, 101(5):052309, 2020
5. I Rungger, N Fitzpatrick, H Chen, CH Alderete, H Apel, A Cowtan, A Pat- terson, D Munoz Ramo, Y Zhu, NH Nguyen, et al. Dynamical mean field theory algorithm and experiment on quantum computers. arXiv preprint arXiv:1910.04735, 2019
6. Andrew Patterson, Hongxiang Chen, Leonard Wossnig, Simone Severini, Dan Browne, and Ivan Rungger. Quantum state discrimination using noisy quan- tum neural networks. arXiv preprint arXiv:1911.00352, 2019 1.3. Statement of authorship 22
7. Jules Tilly, Glenn Jones, Hongxiang Chen, Leonard Wossnig, and Edward Grant. Computation of molecular excited states on ibmq using a discrimina- tive variational quantum eigensolver. arXiv preprint arXiv:2001.04941, 2020 Chapter 2
Notation And Mathematical Preliminaries
2.1 Notation
n We denote vectors with lower-case letters. For a vector x ∈ C , let xi denotes the i-th element of x. A vector is sparse if most of its entries are 0. For an integer k, let [k] denotes the set {1,...,k}.
m×n j For a matrix A ∈ C let A := A:, j, j ∈ [n] denote the j-th column vector of
A, Ai := Ai,:, i ∈ [m] the i-th row vector of A, and Ai j := A(i, j) the (i, j)-th element.
We denote by Ai: j the sub-matrix of A that contains the rows from i to j. The supremum is denoted as sup and the infimum as inf. For a measure space (X,Σ, µ), and a measurable function f an essential upper bound of f is defined ess −1 −1 as Uf := {l ∈ R : µ( f (l,∞)) = 0}, if the measurable set f (l,∞) is a set of measure zero, i.e., if f (x) ≤ l for almost all x ∈ X. Then the essential supremum ess k n is defined as ess sup f := inf Uf . We let the span of a set S = {vi}1 ⊆ C be de- n k k fined by span{S} := x ∈ C |∃{αi}1 ⊆ C with x = ∑i=1 αivi . The set is linearly m×n independent if ∑i αivi = 0 if and only if αi = 0 for all i. The range of A ∈ C m n n is defined by range(A) = {y ∈ R : y = Ax for some x ∈ C } = span(A1,...,A ). Equivalently the range of A is the set of all linear combinations of the columns of A. The nullspace null(A) (or kernel ker(A)) is the set of vectors such that Av = 0. k n b Given a set S = {vi}1 ⊆ C . The null space of A is null(A) = {x ∈ R : Ax = 0}. 2.2. Matrix functional analysis 24
m×n The rank of a matrix A ∈ C , rank(A) is the dimension of range(A) and is equal to the number of linearly independent columns of A; Since this is equal to rank(AH), AH being the complex conjugate transpose of A, it also equals the number of linearly independent rows of A, and satisfies rank(A) ≤ min{m,n}. The trace of a matrix is the sum of its diagonal elements Tr(A) = ∑i Aii. The support of a vector supp(v) is the set of indices i such that vi = 0 and we call it sparsity of the vector. For a matrix we denote the sparsity as the number of non zero entries, while row or column sparsity refers to the number of non-zero entries per row or column. A symmetric matrix A is positive semidefinite (PSD) if all its eigenvalues are non-negative. For a PSD matrix A we write A 0. Similarly A B is the partial ordering which is equivalent to A − B 0.
We use the following standard norms. The Frobenius norm kAkF = q m n ∗ |Ax| A A kAk = n ∑i=1 ∑ j=1 i j i j, and the spectral norm supx∈C , x6=0 |x| . Note that that 2 H H kAkF = Tr A A = Tr AA . Both norms are submultiplicative and unitarily √ invariant and they are related to each other as kAk ≤ kAkF ≤ nkAk. The singular value decomposition of A is A = UΣV H where U,V are unitary matrices and UH defines the complex conjugate transpose, also called Hermitian conjugate, of U. We use throughout the thesis xH or AH for the Hermitian conjugate as well as for the transpose for real matrices and vectors. We denote the pseudo- inverse of a matrix A with singular value decomposition UΣV H as A+ := VΣ+UH.
2.2 Matrix functional analysis
While computing the gradient of the average log-likelihood is a straightforward task when training ordinary Boltzmann machines, finding the gradient of the quantum relative entropy is much harder. The reason for this is that in general
[∂θ H(θ),H(θ)] 6= 0. This means that the ordinary rules that are commonly used in calculus for finding the derivative no longer hold. One important example that we will use repeatedly is Duhamel’s formula:
Z 1 H(θ) H(θ)s H(θ)(1−s) ∂θ e = dse ∂θ H(θ)e . (2.1) 0 2.2. Matrix functional analysis 25
This formula can be easily proven by expanding the operator exponential in a Trotter-Suzuki expansion with r time-slices, differentiating the result and then tak- ing the limit as r → ∞. However, the relative complexity of this expression com- pared to what would be expected from the product rule serves as an important re- minder that computing the gradient is not a trivial exercise. A similar formula also exists for the logarithm as shown further below. Similarly, because we are working with functions of matrices here we need to also work with a notion of monotonicity. We will see that for some of our approx- imations to hold we will also need to define a notion of concavity (in order to use Jensen’s inequality). These notions are defined below.
Definition 1 (Operator monoticity). A function f is operator monotone with respect to the semidefinite order if 0 A B, for two symmetric positive definite operators implies, f (A) f (B). A function is operator concave w.r.t. the semidefinite order if c f (A)+(1−c) f (B) f (cA+(1−c)B), for all positive definite A,B and c ∈ [0,1].
We now derive or review some preliminary equations that we will need in order to obtain a useful bound on the gradients in the main work.
Claim 1. Let A(θ) be a linear operator which depends on the parameters θ. Then
∂ ∂σ A(θ)−1 = −A−1 A−1. (2.2) ∂θ ∂θ
Proof. The proof follows straight forward by using the identity I.
∂I ∂ ∂A ∂A−1 = 0 = AA−1 = A−1 + A . ∂θ ∂θ ∂θ ∂θ
Reordering the terms completes the proof. This can equally be proven using the Gateau derivative.
In the following we will furthermore rely on the following well-known inequal- ity:
n×n n×n Lemma 1 (Von Neumann Trace Inequality). Let A ∈ C and B ∈ C with sin- n n gular values {σi(A)}i=1 and {σi(B)}i=1 respectively such that σi(·) ≤ σ j(·) if i ≤ j. 2.2. Matrix functional analysis 26
It then holds that n |Tr(AB)| ≤ ∑ σ(A)iσ(B)i. (2.3) i=1
Note that from this we immediately obtain
n |Tr(AB)| ≤ ∑ σ(A)iσ(B)i ≤ σmax(B)∑σ(A)i = kBk∑σ(A)i. (2.4) i=1 i i
This is particularly useful if A is Hermitian and PSD, since this implies |Tr(AB)| ≤ kBkTr(A) for Hermitian A.
Since we are dealing with operators, the common chain rule of differentiation does not hold generally. Indeed the chain rule is a special case if the derivative of the operator commutes with the operator itself. Since we are encountering a term of the form logσ(θ), we cannot assume that [σ,σ 0] = 0, where σ 0 := σ (1) is the derivative w.r.t., θ. For this case we need the following identity similarly to Duhamels formula in the derivation of the gradient for the purely-visible-units Boltzmann machine.
Lemma 2 (Derivative of matrix logarithm [34]).
1 d Z dA logA(t) = [sA + (1 − s)I]−1 [sA + (1 − s)I]−1. (2.5) dt dt 0
For completeness we here include a proof of the above identity.
Proof. We use the integral definition of the logarithm [35] for a complex, invertible, n × n matrix A = A(t) with no real negative
Z 1 logA = (A − I) ds[s(A − I) + I]−1. (2.6) 0
From this we obtain the derivative
d dA Z 1 Z 1 d logA = ds[s(A − I) + I]−1 + (A − I) ds [s(A − I) + I]−1. dt dt 0 0 dt 2.2. Matrix functional analysis 27
Applying (2.2) to the second term on the right hand side yields d dA Z 1 Z 1 dA logA = ds[s(A−I)+I]−1 +(A−I) ds[s(A−I)+I]−1s [s(A−I)+I]−1, dt dt 0 0 dt which can be rewritten as
d Z 1 dA logA = ds[s(A − I) + I][s(A − I) + I]−1 [s(A − I) + I]−1 (2.7) dt 0 dt Z 1 dA +(A − I) ds[s(A − I) + I]−1s [s(A − I) + I]−1, (2.8) 0 dt by adding the identity I = [s(A − I) + I][s(A − I) + I]−1 in the first integral and reordering commuting terms (i.e., s). Notice that we can hence just substract the first two terms in the integral which yields (2) as desired. 2.2. Matrix functional analysis 28 minimises Chapter 3
Statistical Learning Theory
This chapter applies insights from statistical learning theory to answer the following question:
Research Question 1 (QML under the lens of SLT). Under the common assump- tions of statistical learning theory, what is the performance of supervised quantum machine learning algorithms?
The main idea in this chapter is to leverage the framework of statistical learn- ing theory to understand how the minimum number of samples required by a learner to reach a target generalisation accuracy influences the overall performance of quantum algorithms. By taking into account well known bounds on this accu- racy, we can show that quantum machine learning algorithms for supervised ma- chine learning are unable to achieve polylogarithmic runtimes in the input dimen- sion. Notably, the results presented here hold only for supervised quantum ma- chine learning algorithms for which statistical guarantees are available. Our re- sults show that without further assumptions on the problem, known quantum ma- chine learning algorithms for supervised learning achieve only moderate polyno- mial speedups over efficient classical algorithms - if any. We note that the quantum machine learning algorithms that we analyse here are all based on fast quantum linear algebra subroutines [36, 37]. These in particular include quantum quan- tum support vector machines [14], quantum linear regression, and quantum least squares [38, 39, 40, 41, 37, 42]. 30
Notably, the origin of the ‘slow down’ of quantum algorithms under the above consideration is twofold.
First, most of the known quantum machine learning algorithms have at least an inversely linear scaling in the optimisation or approximation error of the algorithm, i.e., they require a computational time O(ε−α ) (for some real positive α) to return a solutionx ˜ which is ε-close in some norm to the true solution x of the problem. For example, in case of the linear systems algorithm, we obtain a quantum state |x˜i such that k|x˜i − |xik ≤ ε, where |xi is the state that encodes the exact solution to the linear system |xi = |A−1bi. This is in stark contrast to classical algorithms which typically have logarithmic dependency with respect to the error, i.e., for running the algorithm for time O(log(ε)), we obtain a solution with error ε.
Second, a crucial bottleneck in many quantum algorithms is the requirement to sample in the end of most quantum algorithms. This implies generally another error, since we need to repeatedly measure the resulting quantum state in order to obtain the underlying classical result.
We note that we mainly leverage these two sources of error in the follow- ing analysis, but the extension of this to further include noise in the computation is straightforward. Indeed, noise in the computation (a critical issue in the current gen- eration of quantum computers) could immediately be taken into account by simply adding a linear factor in terms of the error decomposition that we will encounter in Eq. 3.22. This will be a further limiting factor for near term devices as such errors need to be surpressed sufficiently in order to obtain good general bounds.
While previous research has already identified a number of caveats such as the data access [43] or restrictive structural properties of the input data, which limit the practicality of these algorithms, our insights are entirely based on statistical analysis.
As we will discuss in more detail in chapter 4, under the assumption that we can efficiently sample rows and columns of the input matrix (i.e., the input data) according to a certain distribution, classical algorithms can be shown to be nearly as efficient as these quantum machine learning algorithms [17]. We note that the 3.1. Review of key results in Learning Theory 31 scaling of the quantum machine learning algorithms typically still achieve a high polynomial advantage compared to the classical ones.
This chapter is organised as follows. First, in Section 3.1, we will review ex- isting results from statistical learning theory in order to allow the reader to follow the subsequent argument. Section 3.3.1 takes into account the error that is intro- duced by the algorithm itself. In Section 3.3.2, we then use this insight to bound the error that is induced through the sampling process in quantum mechanics. An ad- ditional dependency is typically given by the condition number of the problem, and we hence derive additional bounds for it in Section 3.3.3. Finally, in Section 3.4, we accumulate these insights and use them to analyse a range of existing supervised quantum machine learning algorithms.
3.1 Review of key results in Learning Theory
Statistical Learning Theory has the aim to statistically quantify the resources that are required to solve a learning problem [44]. Although multiple types of learning settings exist, depending on the access to data and the associated error of a predic- tion, here we primarily focus on supervised learning. In supervised learning, the goal is to find a function that fits a set of input-output training examples and, more importantly, guarantees also a good fit on data points that were not used during the training process. The ability to extrapolate to data points that are previously not observed is also known as generalisation ability of the model. This is indeed the major difference between machine learning and standard optimisation processes. Although this problem can be cast into the framework of optimisation (by opti- mising a certain problem instance), setting up this instance to achieve a maximum generalisation performance is indeed one of the main objectives of machine learn- ing.
After the review of the classical part, i.e., the important points of considera- tion for any learning algorithm and the assumptions taken, we also analyse existing quantum algorithms, and their computational speedups within the scope of statisti- cal learning theory. 3.1. Review of key results in Learning Theory 32 3.1.1 Supervised Learning We now set the stage for the analysis by defining the framework of supervised learn- ing. Let X and Y be probability spaces with distribution ρ on X ×Y from which we sample data points x ∈ X and the corresponding labels y ∈ Y. We refer to X and Y as input set and output set respectively. Let ` : Y ×Y 7→ R be the loss function measur- ing the discrepancy between any two points in the input space, which is a point-wise error measure. There exist a wide range of suitable loss-functions, and choosing the appropriate one in practice is of great importance. Typical error functions are the 2 least-squares error, `sq( f (x),y) := ( f (x) − y) over Y = R for regression (generally for dense Y), or the 0 − 1 loss `0−1( f (x),y) := δ f (x),y over Y = {−1,1} for clas- sification (generally for discrete Y). The least squares loss will also use be used frequently throughout this chapter. For any hypothesis space H of measurable functions f : X 7→ Y, f ∈ H , the goal of supervised learning is then to minimise the expected risk or expected error E ( f ) := Eρ [`(y, f (x))], i.e.,
Z inf E ( f ), E ( f ) = `( f (x),y))dρ(x,y). (3.1) f ∈H X×Y
We hence want to minimise the expected prediction error for a hypothesis f : X → Y, which is the average error with respect to the probability distribution ρ. If the loss function is measurable, then the target space is the space of all measurable functions. The space of all functions for which the expected risk is well defined is called the target space, and typically denoted by F . For many loss functions we can in practice not achieve the infimum, however, it is still possible to derive a minimizer. In order to be able to efficiently find a solution to Eq. 3.1, rather than searching over the entirety of F , we restrict the search over a restricted hypothesis space H , which indeed can be infinite. In summary, a learning problem is defined by the following three components:
1. A probability space X ×Y with a Borel probability measure ρ.
2. A measureable loss function ` : Y ×Y 7→ [0,∞).
3. A hypothesis space H from which we choose our hypothesis f ∈ H . 3.1. Review of key results in Learning Theory 33
d The data or input spaces X can be vector spaces (linear spaces) such as X = R , d ∈ N, or structured spaces, and the output space Y can also take a variety of forms as we mentioned above. In practice, the underlying distribution ρ is unknown and we can only access it n through a finite number of observations This finite set of samples, Sn = {(xi,yi)}i=1, xi ∈ X, yi ∈ Y, is called the training set, and we generally assume that these are sam- pled identically and independently according to ρ. This is given in many practical cases but it should be noted that this is not generally true. Indeed, for example for time series, subsequent samples are typically highly correlated, and the following analysis hence does not immediately hold. The assumption of independence can however be relaxed to the assumption that the data does only depend slightly on each other via so-called mixing conditions. Under such assumptions, most of the following results for the independent case still hold with only slight adaptations. The results throughout this chapter rely on the following assumptions.
Assumption 1. The probability distribution on the data space X ×Y can be factor- ized into a marginal distribution ρX on X and a conditional distribution ρ(·|x) on Y.
We add as a remark the observation that the probability distribution ρ can take into account a large set of uncertainty in the data. The results therefore hold for a range of noise types or partial information.
Assumption 2. The probability distribution ρ is known only through a finite set of n samples Sn = {xi,yi}i=1, xi ∈ X, yi ∈ Y which are sampled i.i.d. according to the Borel probability measure ρ on the data space X ×Y.
3.1.2 Empirical risk minimization and learning rates
Under the above assumptions, the goal of a learning algorithm is then to choose a suitable hypothesis fn : X 7→ Y, fn ∈ H for the minimizer of the expected risk based on the data set Sn. Empirical Risk Minimization (ERM) approaches this problem 3.1. Review of key results in Learning Theory 34 by choosing a hypothesis that minimises the empirical risk,
1 inf En( f ), En( f ) := `( f (xi),yi), (Empirical Risk) (3.2) f ∈H n ∑ (xi,yi)∈Sn
n n given the i.i.d. drawn data points {(xi,yi)}i=1 ∼ ρ . Note that
1 n n n ( n( f )) = [`( f (xi),yi)] = ( f ), E{x,y}i=1∼ρ E ∑ E(xi,yi)∼ρ E n i=1 i.e., the expectation of the empirical risk is the expected risk, which implies that we can indeed use the empirical risk as proxy for the expected risk in expectation.
While we would like to use the empirical risk as a proxy for the true risk, we also need to ensure by minimising the empirical risk, we actually find a valid so- lution to the underlying problem, i.e., the fn that we find is approaching the true solution f∗ which minimises the empirical risk. This requirement is termed consis- tency. To define this mathematically, let
fn := argmin En( f ) (3.3) f ∈H be the minimizer of the empirical risk, which exists under weak assumptions on H . Then, the overall goal of a learning algorithm is to minimise the excess risk,
E ( fn) − inf E ( f ), (Excess risk) (3.4) f ∈H while ensuring that fn is consistent for a particular distribution ρ, i.e., that
lim E ( fn) − inf E ( f ) = 0. (Consistency) (3.5) n→∞ f ∈H
Since Sn is a randomly sampled subset, we can analyse this behaviour in ex- pectation, i.e., lim E E ( fn) − inf E ( f ) = 0 (3.6) n→∞ f ∈H 3.1. Review of key results in Learning Theory 35 or in probability, i.e.,
lim Prρn E ( fn) − inf E ( f ) > ε = 0 (3.7) n→∞ f ∈H for all ε > 0. If the above requirement holds for all distributions ρ on the data space, then we say that it is universally consistent. However, consistency is not enough in practice, since the convergence of the risk of the empirical risk minimizer and the minimal risk could be impracticably slow. One of the most important questions in a learning setting is therefore how fast this convergence happens, which is defined through the so-called learning rate, i.e., the rate of decay of the excess risk. We assume that this scales somewhat with respect to n, as for example
−α E E ( fn) − inf E ( f ) = O(n ). (3.8) f ∈F
This speed of course has a practical relevance and hence allows us to compare dif- ferent algorithms. The sample complexity n must depend on the error we want to achieve, and in practice we can therefore define it as follows: For a distribution ρ, ∀δ,ε > 0, there exists a n(δ,ε) such that
Prρn E ( fn(ε,δ)) − inf E ( f ) ≤ ε ≥ 1 − δ. (3.9) f ∈F
This n(δ,ε) is called the sample complexity. One challenge of studying our algorithm’s performance through the excess risk is that we can generally not assess it. We therefore need to find a way estimate or bound the excess risk solely based on the hypothesis fn and the empirical risk En. Let us for this assume the existence of a minimizer
f∗ := inf E ( f ) (3.10) f ∈H over a suitable Hypothesis space H . 3.1. Review of key results in Learning Theory 36
We can then decompose the excess risk as
E ( fn) − E ( f∗) = E ( fn) − En( fn) + En( fn) − En( f∗) + En( f∗) − E ( f∗). (3.11)
Now observe that En( fn) − En( f∗) ≤ 0, we immediately see that we can bound this quantity as
E ( fn) − E ( f∗) = E ( fn) − inf E ( f ) ≤ 2 sup |E ( f ) − En( f )|, (3.12) f ∈H f ∈H which implies that we can bound the error in terms of the so-called generalisation error E ( f ) − En( f ). We therefore see that we can study the convergence of the excess risk in terms of the generalisation error. Controlling the generalisation error is one of the main objectives of statistical learning theory. A fundamental result in statistical learning theory, which is often referred as the fundamental theorem of statistical learning, is the following:
Theorem 1 (Fundamental Theorem of Statistical Learning Theory [45, 46, 44]). Let H be a suitably chosen Hypothesis space of functions f : X 7→ Y,X ×Y be a probability space with a Borel probability measure ρ and a measurable loss func- tion ell : Y ×Y 7→ [0,∞), and let the empirical risk and risk be defined as in Eq. 3.1 and Eq. 3.2 respectively. Then, for every n ∈ N, δ ∈ (0,1), and distributions ρ, with probability 1 − δ it holds that
r ! c(H ) + log(1/δ) sup |En( f ) − E ( f )| ≤ Θ , (3.13) f ∈H n where c(H ) is a measure of the complexity of H (such as the VC dimension, covering numbers, or the Rademacher complexity [47, 44]).
We note that lower bounds for the convergence of the empirical to the (ex- pected) risk are much harder to obtain in general, and generally require to fix the underlying distribution of the data. We indeed need for the following proof a lower bound on this complexity which is not generally available. However, for the specific 3.1. Review of key results in Learning Theory 37 problems we are treating here this holds true and lower bounds do indeed exist. For simplicity, we will rely on the above given bound, as the other factors that occur in these bounds only play a minor role here.
3.1.3 Regularisation and modern approaches
The complexity of the hypothesis space, c(H ) in Eq. 3.13 relates to the phe- nomenon of overfitting, where a large hypothesis space results in a low training error on the empirical risk, but performs poorly on the true risk. In the literature, this problem is addressed with so-called regularisation tech- niques, which are able to limit the size of the Hypothesis space and thereby its complexity, in order to avoid overfitting the training dataset. A number of different regularisation strategies have been proposed in the lit- erature [45, 48, ?], including the well-established Tikhonov regularisation, which directly imposes constraints on the hypotheses class of candidate predictors. From a computational perspective, Tikhonov regularisation, and other similar approaches compute a solution for the learning problem by optimising a constraint objective (i.e., the empirical risk with an additional regularisation term). The solu- tion is obtained by a sequence of standard linear algebra operations such as matrix multiplication and inversion. Since the standard matrix inversion time is O(n3) for a n×n square matrix, most of the solutions such as for GP or SVM, can be found in O(n3) computational time for n data points. Notably, improvements to this runtime exist based on exploiting sparsity, or trading an approximation error against a lower computational time. Additionally, the time to solution typically depends on the con- ditioning of the matrix, which therefore can be lowered by using preconditioning methods. Regularisation is today widely used, and has led to many popular machine learning algorithms such as Regularised Least Squares [47], Gaussian Process (GP) Regression and Classification [49], Logistic Regression [48], and Support Vector Machines (SVM) [45]. All the above mentioned algorithms fall under the same umbrella of kernel methods [50]. To further reduce the computational cost, modern methods leverage on the fact 3.1. Review of key results in Learning Theory 38 that regularisation can indeed be applied implicitly through incomplete optimisa- tion or other forms of approximation. These ideas have been widely applied in the so-called early stopping approaches which are today standardly used in practice. In early stopping, one only performs a limited number of steps of an iterative optimi- sation algorithm, typically in gradient based optimisation. It can indeed be shown for convex functions, that this process avoids overfitting the training set (i.e., main- tains an optimal learning rate), while the computational time is drastically reduced. All regularisation approaches hence achieve a lower number of required operations, while maintaining similar or the same generalisation performance of approaches such as Tikhonov regularisation [?], in some cases provably.
Other approaches include the divide and conquer [51] approach or Monte- Carlo sampling (also-called sub-sampling) approaches. While the former is based on the idea of distributing partitions of the initial training data, training different pre- dictors on the smaller problem instances, and then combining individual predictors into a joint one, the latter achieves a form of dimensionality reduction through sam- pling a subset of the data in a specific manner. The most well-known sub-sampling methods are random features [52] and so called Nystrom¨ approaches [53, 54].
In both cases, computation benefits from parallelisation and the reduced dimension of the datasets while similarly maintaining statistical guarantees (e.g.,[55]).
For all the above mentioned training methods, the computational times can typ- ically be reduced from the O(n3) of standard approaches to Oe(n2) or Oe(nnz), where nnz is the number of non-zero entries in the input data (matrix), while maintaining optimal statistical performance.
Since standard regularisation approaches can trivially be integrated into quan- tum algorithms - such as regularised least squares - certain methods appear not to work in the quantum algorithms toolbox.
For example, preconditioning, as a tool to reduce the condition number and make the computation more efficient appears not to have an efficient solution yet in the quantum setting [12]. Therefore, more research is required to give a full picture 3.2. Review of supervised quantum machine learning algorithms 39 of the power and limitations of algorithms with respect to all parameters. Here we offer a brief discussion of the possible effects of inverting badly conditioned matrices and how typical cases could affect the computational complexity, i.e., the asymptotic runtime of the algorithm.
3.2 Review of supervised quantum machine learning algorithms
The majority of proposed supervised quantum machine learning algorithms are based on fast linear algebra operations. Indeed, most quantum machine learn- ing algorithms that claim an exponential improvement over classical counter- parts are based on a fast quantum algorithm for solving linear systems of equa- tions [38, 39, 14, 40, 41, 37, 42]. This widely used subroutine is the HHL algorithm, a quantum linear system solver [36] (QLSA), which is named after its inventors Harrow, Hassidim, and Lloyd. The HHL algorithm takes as input the normalised n n×n state |bi ∈ R and a s(A)-sparse matrix A ∈ R , with spectral norm kAk ≤ 1 and condition number κ = κ(A), and returns as an output a quantum state |w˜i which en- − n codes an approximation of the normalised solution |wi = |A 1bi ∈ R for the linear system Aw = b such that k|w˜i − |wik ≤ γ, (3.14) for error parameter γ. Note that above we assumed that the matrix is invertible, how- ever, the algorithm can in practice perform the Moore-Penrose inverse (also known n×m + H − H as pseudoinverse), which is defined for arbitrary A ∈ R by A := (A A) 1A , and using the singular value decomposition of A = UΣV H, we hence have
A+ = (VΣ2V H)−1VΣUH = VΣ−1UH, (3.15)
+ H such that A A = VV = Im.
The currently best known quantum linear systems algorithm, in terms of com- 3.2. Review of supervised quantum machine learning algorithms 40 putational complexity, runs in
O(kAkF κ polylog(κ,n,1/γ)), (3.16)
√ √ time [37], where kAkF ≤ nkAk2 ≤ n is the Frobenius norm of A and κ its con- dition number. As we will discuss in more detail in Chapter 4 on randomised nu- merical linear algebra, such computations can also be done exponentially faster compared to known classical algorithms using classical randomised methods in combination with a quantum-inspired memory structure, by taking advantage of aforementioned Monte-Carlo sampling methods [56, 17, 18, 19, 21] if the data ma- trix (A) is of low rank. We note, that for full-rank matrices, an advantage is however still possible.
Before analysing the supervised machine learning algorithms with the above discussed knowledge from statistical learning theory, we will first recapitulate the quantum least squares algorithm (linear regression) and the quantum support vec- tor machine (SVM). Throughout this chapter, we use the least squares problem as a prototypical case to study the behaviour of QML algorithms, but the results ex- tend trivially to the quantum SVM and many other algorithms. In Section 3.4 we also summarise the computational complexities taking into account the statistical guarantees for all other algorithms and hence give explicit bounds.
3.2.1 Recap: Quantum Linear Regression and Least Squares
The least squares algorithm minimises the empirical risk with respect to the quadratic loss `LS( f (x),y) = ( f (x) − y)2 , (3.17) for the hypothesis class of linear functions
d H H := { f : X 7→ Y|∃w ∈ R : f (x) = w x}, (3.18) 3.2. Review of supervised quantum machine learning algorithms 41
d n with input space R and outputspace R. Given input-output samples {xi,yi}i=1, d where xi ∈ R , and yi ∈ R. The empirical risk is therefore given by
n 1 H 2 En( f ) := ∑ w xi − yi . (3.19) n i=1 the least squares problem seeks to find a vector w such that
2 w = d ky − Xwk , argminw∈R 2 (3.20)
n n×d where y ∈ R and X ∈ R . The closed form solution of this problem is then given by w = (XHX)−1XHy, and we can hence reformulate this again into a linear systems problem of the form Aw = b, where A = (XHX) and b = XHy. We obtain the solution then by solving the linear system w = A−1b.
Since several quantum algorithms for linear regression and in particular least squares have been proposed [38, 39, 40, 41, 37, 42], which all result in a similar scaling (taking into account the most recent subroutines), we will in the following analysis use the best known result for the quantum machine learning algorithm. All approaches have in common that they make use of the quantum linear system algorithm to convert the state |ξi = |XHyi into the solution |w˜i = |(XHX)−1ξi. The fastest known algorithm [37], indeed solves (regularised) least squares or linear regression problem in time
O(kAkF κ polylog(n,κ,1/γ)), where κ2 is the condition number of XHX or XXH respectively, and γ > 0 is an error parameter for the approximation accuracy. Notably, this algorithm precludes a physical measurement of the resulting vector |w˜i, since this would immediately imply a complexity of
O(kAkF κ/γ polylog(n,κ,1/γ)).
In that sense, the algorithm does solve the classical least squares problem only to 3.2. Review of supervised quantum machine learning algorithms 42 a certain extent as the solution is not accessible in that time. Indeed it prepares a quantum state |w˜i which is γ-close to |wi, i.e.,
k|w˜i − |wik ≤ γ, and in order to recover it, we would need to take up to O(nlog(n)) samples. In the current form, we can however immediately observe that the Frobenius norm de- pendency implies that the algorithm is efficient if X is low-rank (but no necessarily non-sparse). As for all of the supervised quantum machine learning algorithms for classical input data, the quantum least squares solver requires a quantum-accessible data structure, such as a qRAM.
Notably, it is assumed that 1 ≤ XHX ≤ 1. The output of the algorithm is κ2 then a quantum state |w˜i, such that k|w˜i − |wik ≤ ε, where |wi is the true solution.
We note that other linear regression algorithms based on sample-based Hamil- tonian simulation are possible [13, 57], which result in different requirements. In- deed, for these algorithms we need to repeatedly prepare a density matrix, which is a normalised version of the input data matrix. While this algorithm has generally worse dependencies on the error γ, it is independent of the Frobenius norm [39]. The computational complexity in this case is
O(κ2γ−3polylog(n)), which can likely be improved to O(κγ−3polylog(n,κ,γ−1)).
However, since our analysis will indeed show that a higher polynomial depen- dency will incur a worse runtime once we take statistical guarantees into account, we will use the algorithm in the following which has the lowest dependency. No- tably, the error dependency is in either case polynomial if we require a classical solution to be output.
As a further remark, other linear regression or least squares quantum algo- rithms exist [40, 42], but we will not include these here as our results can easily be extended to these as well. 3.2. Review of supervised quantum machine learning algorithms 43
Next, we will also recapitulate the quantum support vector machine.
3.2.2 Recap: Quantum Support Vector Machine
The second prototypical quantum machine learning algorithm which we want to recapitulate is the quantum least-squares support vector machine [14] (qSVM). As we will see, the procedure for the qSVM is similar to the quantum least squares approach, and therefore results in very similar runtimes. The qSVM algorithm is calculating the optimal separating hyperplane by solving again a linear system of n d equations. For n points Sn = {(xi,yi)}i=1 with xi ∈ R ,yi = {±1}, and again assum- ing that we can efficiently prepare states corresponding to the data vectors, then the least-squares formulation of the solution is given by the linear system of the form
H 0 ~1 w0 0 = , (3.21) ~1 K + δ −1I w y
H H where Ki j = xi x j (or Ki j = φ(xi) φ(x j) respectively for a non-linear features) is the H kernel matrix, y = (y1,...,yn) , ~1 is the all-ones vector, and δ is a user specified parameter. We note that certain authors argue that a least square support vector ma- chine is not truly a support vector machine, and their practical use highly restricted. The additional row and column in the matrix on the left hand side arise because of a H H H n non-zero offset. Notably, w x + w0 > 1 or w x + w0 < −1, with w x = ∑ j=1 w jx j determines the hyperplanes. The solution is hence obtained by solving the linear systems using the HHL algorithm based on the density matrix exponentiation [13] method previously mentioned. The only adaptation which is necessary is to use the normalised Kernel Kˆ = K/tr(K). However, since the smallest eigenvalues of Kˆ will be of O(1/n) due to the normalisation, the quantum SVM algorithm truncates the eigenvalues which are below a certain threshold δK, s.t., δK ≤ |λi| ≤ 1, which results in an effective condition number κe f f = 1/δK, thereby effectively implementing a form of spectral filtering. The runtime of the quantum support vector machine is given by
3 −3 O(κe f f γ polylog(nd,κ,1/γ)), 3.3. Analysis of quantum machine learning algorithms 44
H and outputs a state |w˜ni that approximates the solution wn := [w0, w] , such that k|w˜ni − |wnik ≤ γ. Similar as for the least squares algorithm, we cannot retrieve the parameters without an overhead, and the quantum SVM therefore needs to perform immediate classification.
3.3 Analysis of quantum machine learning algo- rithms
The quantum algorithms we analyse throughout this chapter rely on a range of pa- rameters, which include the input dimension n, which corresponds to the number of data points in a sample (the dimension of the individual data point is typically small so we focus on this part), the error of the algorithm with respect to the final predic- tion γ, and the condition number κ of the input data matrix. Our main objective is to understand the performance of these algorithms if we want to achieve an overall generalisation error of Θ(n−1/2).
To start, we therefore first need to return to our previous assessment of the risk, and use in the following a standard error decomposition. Let f be a hypoth- esis, and let F is the space of all measurable functions f : X 7→ Y. We define ∗ ∗ by E := inf f ∈F E ( f ) the Bayes risk, and want to limit the distance E ( f ) − E .
Let now EH := inf f ∈H E ( f ), i.e., the best risk attainable by any function in the hy- pothesis space H , where we assume in the following for simplicity that EH always admits a minimizer fH ∈ H . Note, that it is possible to remove this assumption by leveraging regularisation. We then decompose the error as:
∗ ˆ ˆ ∗ E ( f ) − E = E ( f ) − E ( f ) + E ( f ) − EH + EH − E (3.22) | {z } | {z } | {z } Optimisation error Estimation error Irreducible error √ = ξ + Θ(1/ n) + µ. (3.23)
The first term in Eq. 3.22 is the so-called optimisation error which indicates how good the optimisation procedure which generates f is, with respect to the actual minimum (infimum) of the empirical risk. This error stems from the approximations 3.3. Analysis of quantum machine learning algorithms 45 an algorithm typically makes, and relates to the γ in previous sections. The optimi- sation error can result from a variety of approximations, such as a finite number of steps in an iterative optimisation process or a sample error introduced through a non- deterministic process. This error is discussed in detail in Section 3.3.1. The second term is the estimation error which is due to taking the empirical risk as a proxy for the true risk by using samples from the distribution ρ. This can be bound by the generalisation bound we discussed in Eq. 3.13. The last term is the irreducible error which measures how well the hypothesis space describes the problem. If the true solution is not in our hypothesis space, there will always be an irreducible error that we indicate with the letter µ. If µ = 0, i.e., irreducible error is zero, then we call H is universal. For simplicity, we assume here that µ = 0, as it also will not impact the results of this paper much.
From the error decomposition in Eq. 3.22 we see that in order to achieve the best possible generalisation error overall, we need to make sure that the different error contributions are of the same order. We therefore in particular need to ensure that the optimisation error matches the scaling of the estimation error. Since for most known classical algorithms, with the exceptions of e.g., Monte-Carlo algo- rithms, the optimisation error typically scales with O(log(1/ε)) and matching the bounds is usually trivial. However, many quantum algorithms,including some of the quantum linear regression and least squares algorithms we discussed in the previous section (e.g. [14, 39]), have a polynomial dependency on the optimisation error. In the next section we discuss the implications of matching the bounds, and how they affect the algorithms computational complexity.
Notably, other quantum algorithms have only a polylogarithmic error depen- dency, such as [37], and therefore the error matching does not impose any critical slowdown. In these cases, however, we will see that quantum algorithms argument still cannot achieve a polylogarithmic runtime in the dimension of the training set due to the error resulting from the finite sampling process that is required to extract a classical output from a given quantum state.
Finally, to take into account all dependencies of the quantum algorithms, we 3.3. Analysis of quantum machine learning algorithms 46 also analyse the condition number. Here, we show that with high probability the condition number has a polynomial dependency on the number of samples in the training set as well, which therefore indicates a certain scaling of the computational complexity.
We do the analysis in the following exemplary for the least squares case, and summarise the resulting computational complexities of a range of supervised quan- tum machine learning algorithms next to the classical ones then in Fig. 3.1.
3.3.1 Bound on the optimisation error
As previously mentioned, we will use the quantum least squares algorithms [38, 39, 40] as an example case to demonstrate how the matching of the error affects the algorithm. The results we obtain can easily be generalised to other algorithms and instances, and in particularly hold for all kernel methods. As we try to remain general, we will do the analysis with a general algorithm with the computational complexity Ω nα γ−β κc log(n) (3.24)
We show that in order to have a total error that scales as n−1/2, the quantum algo- rithm will pick up a polynomial n-dependency.
The known quantum least squares algorithms have a γ error guarantee for the
final output state |w˜i, i.e., k|wi − |w˜ik2 ≤ γ, where |wi is the true solution. The computational complexity (ignoring all but the error-dependency), is for all algo- rithms of the form O(γ−β ) for some β, for example [39] with β = 3, or [58] with β = 4.
Since the quantum algorithms require the input data matrix to be either Her- mitian or encoded in a larger Hermitian matrix, the dimensionality of the overall d matrix is n+d for n data points in R . For simplicity, we here assume that the input matrix is a n×n Hermitian matrix, and neglect this step. In order to achieve the best possible generalisation error, as discussed previously, we want to match the errors of the incomplete optimisation to the statistical ones. In the least squares setting, w˜ = wγ,n is the output of the algorithm corresponding to the optimal parameters fit- 3.3. Analysis of quantum machine learning algorithms 47 ted to the Sn data points, which exhibits at most a γ-error. Therefore,w ˜ corresponds to the estimator fγ,n in the previous notation that we saw in Eq. 3.22 (and Eq. 3.12).
Concretely, we can see that the total error of an estimator fγ,n on n data points with precision ε is given by
E ( fγ,n) − E ( fn) =
= E ( fγ,n) − En( fγ,n) + En( fγ,n) − En( fγ ) + En( fγ ) − E ( fγ ) | {z } | {z } | {z } generalisation error Optimisation error generalisation error −1/2 = Θ(n ) + En( fγ,n) − En( fγ ), (3.25) | {z } Optimisation error where the first contribution is a result of Eq. 3.13, i.e., the generalisation perfor- mance, and the second comes from the error of the quantum algorithm, which we will show next. In order to achieve the best statistical performance, which means to achieve the lowest generalisation error, the algorithmic error must scale at worst as the worst statistical error. We will next show that the optimisation error of a quantum algorithm in terms of the prediction results in a γ error, which is inherited from weights |w˜i, which the quantum algorithm produces. Recalling, that in least squares classification is performed via the inner product, i.e.,
H ypred := w˜ x, (3.26)
H for modelw ˜ and data point x which corresponds to fγ,n(x) in the general notation.
This then will result in the expected risk of the estimator fγ,n to be
n 1 H 2 En( fγ,n) = ∑ w˜ xi − yi . (3.27) n i=1
Therefore, assuming the output of the quantum algorithm is a statew ˜, while the exact minimizer of the empirical risk is w, s.t., k|w˜i − |wik2 ≤ γ, and assuming that 3.3. Analysis of quantum machine learning algorithms 48
|X| and |Y| are bounded, then we find that
1 n H 2 H 2 |En( fn,γ ) − En( fn)| ≤ ∑ w˜ xi − yi − w xi − yi n i=1 1 n H ≤ ∑ L (w˜ − w) xi n i=1 1 n ≤ ∑ Lkw˜ − wk2 kxik2 ≤ k · γ = O(γ), (3.28) n i=1 where k > 0 is a constant, and we used Cauchy-Schwartz, and the fact that that for the least-square it holds that
LS LS |` ( f (xi),yi) − ` ( f (x j),y j)| ≤ L|( f (xi) − yi) − ( f (x j) − y j)|, (3.29) since |X|, and |Y| bounded.
A few remarks. In the learning setting the number of samples is fixed, and hence cannot be altered, i.e., the statistical error (generalisation error) is fixed to √ Θ(1/ n), and the larger n is taken, the better the guarantees we are able to obtain for future tasks. Therefore, it is important to understand how we can reduce the other error contributions in Eq. 3.25 in order to guarantee that we have the lowest possible overall error, or accuracy.
To do so, we match the error bounds of the two contributions, so that the overall performance of the algorithm is maximised, which means that the optimisation error should not surpass the statistical error. We hence set γ = n−1/2, and see that the overall scaling of the algorithm will need be of the order O nβ/2 , ignoring again all other contributions. To take a concrete case, for the algorithm in [39] the overall runtime is then of at least O(n3/2). The overall complexity of the algorithm then has the form Ω nα nβ/2 log(n)κc (3.30) for some constant c,β,α
This straightforward argument from above can easily be generalised to arbi- trary kernels by replacing the input data x with feature vectors φ(x), where φ(·) is a 3.3. Analysis of quantum machine learning algorithms 49 chosen feature map.
We have so far only spoken about algorithms which naturally have a polyno- mial 1/γ-dependency. However, as we previously mentioned, not all quantum algo- rithms have such an error. For algorithms which only depend polylogarithmically on 1/γ, however, the quantum mechanical nature will incur another polynomial n dependency as we will see next.
3.3.2 Bounds on the sampling error
So far we have ignored any error introduced by the measurement process. However, we will always need to compute a classical estimate of the output of the quantum algorithm, which is based on a repeated sampling of the output state. As this is an inherent process which we will need to perform for any quantum algorithm, the following analysis applies to any QML algorithm with classical output. Since we estimate the result by repeatedly measuring the final state of our quantum compu- tation in a chosen basis, our resulting estimate for the desired output is a random variable. It is well known from the central limit theorem, that the sampling error √ for such a random variable scales as O(1/ m), where m is the number of indepen- dent measurements. This is known as the standard quantum limit or the so-called shot-noise limit. Using so-called quantum metrology it is sometimes possible to overcome this limit and obtain an error that scales with 1/m. This however poses the ultimate limit to measurement precision which is a direct consequence of the Heisenberg uncertainty principle [6, 7].
Therefore, any output of the quantum algorithm will have a measurement error τ. Let us turn back to our least squares quantum algorithm. It produces a state |w˜i which is an approximation to the true solution |wi. Using techniques such as quantum state tomography we can produce a classical estimatew ˆ of the vectorw ˜ with accuracy
kw˜ − wˆk2 ≤ τ = Ω(1/m), (3.31) where m is the number of measurements performed. If y = wHx is the error-free (ideal) prediction, then we can hence only produce an approximationy ˆ = wˆHx, such 3.3. Analysis of quantum machine learning algorithms 50 that
|y − yˆ| = |wHx − wˆHx| (3.32)
≤ kw − w˜ + τk kxk (3.33)
≤ (γ + τ) kxk, (3.34) using again Cauchy-Schwartz. Similar to the previous approach, we need to make sure that the contribution coming from the measurement error scales at most as the worst possible generalisation error, and hence set τ = n−1/2. From this, we imme- diately see that any quantum machine learning algorithm which is to reach opti- mal generalisation performance, will require a number of m = Ω(n1/2) repetitions, which is hence a lower bound for all supervised quantum machine learning algo- rithms. For algorithms which do not take advantage of forms of advanced quantum metrology, this might even be Ω(n). Putting things together, we therefore have a scaling of any QML algorithm of
Ω nα+(1+β)/2 log(n)κc , (3.35) which for the state-of-the-art quantum algorithm for quantum least squares [37] result in Ω n2κ log(n). (3.36)
In order to determine the overall complexity, we hence only have one parameter left: κ. However, already now we observe that the computational complexity is similar or even worse compared to the best classical machine learning algorithms.
3.3.3 Bounds on the condition number
In the following we will do the analysis of the last remaining depedency of the quantum algorithms. The condition number. Let the condition number dependency c + of the QML algorithm again be given by κ for some constant c ∈ R . Note that the best known result has a c = 1 dependency, ignoring logarithmic dependencies. We can think of the following three scenarios for the condition number. 3.3. Analysis of quantum machine learning algorithms 51
1. Best case scenario: In the best case setting, the condition number is one or sufficiently close to one. This is the lower bound and can only ever happen if the data is full rank and all the eigenvalues are of very similar size, i.e.,
λi ≈ λ j for all i, j. However, for such cases, it would be questionable whether a machine learning algorithm would be useful, since this would imply that the data lacks any strong signal. In these cases the quantum machine learning algorithms could be very fast and might give a quantum advantage if the n- scaling due to the error-dependency is not too high.
2. Worst case scenario: On the other extreme, the condition number could be in- finite, as could be the case for very badly conditioned matrices with smallest eigenvalues approaching 0. This can be the case if we have one or a few strong signals (i.e., eigenvalues which are closer to 1), and a small additional noise which results in the smallest eigenvalues being close to 0. Such ill condi- tioned systems do indeed occur in practice, but can generally be dealt with by using spectral-filtering or preconditioning methods, as for example discussed in [59]. Indeed, the quantum SVM [14] or the HHL algorithm [36] do or can readily make use of such methods. Concretely they do only invert eigenvalues which are above a certain threshold. This hence gives a new, effective con-
dition number κe f f = σmax/σthreshold ≤ 1/σthreshold which is typically way smaller compared to the actual κ, and makes algorithms practically useful. However, it should be noted that quantum algorithms which perform such steps need to be compared against corresponding classical methods. Note, that such truncations (filters) typically introduce an error, which then needs to be taken into account separately. Having covered these two extreme sce- narios, we can now focus on a typical case.
3. A plausible case: While the second case will appear in practice, these bounds give little insight into the actual performance of the quantum machine learn- ing algorithms, since we cannot infer any scaling of κ from them. However, for kernel based methods, we can derive a plausible case which can give us some intuition of how bad the κ-scaling typically can be. We will in the 3.3. Analysis of quantum machine learning algorithms 52
following show that with high probability, the condition number for a kernel method can have a certain n-dependency. Even though this result gives only a bound in probability, it is a plausible case with concrete n-dependency (the dimension of the input matrix) rather than the absolute worst case of κ = ∞, which gives an impractical upper bound. As a consequence, a quantum kernel √ method which scales as O(κ3) could pick up a factor of O(n n) in the worst case which has the same complexity as the classical state of the art.
In the following, we now prove a lower bound for the condition number of a covariance or kernel matrix assuming that we have at least one strong signal in the data. The high level idea of this proof is that the sample covariance should be close to the true covariance with increasing number of samples, which we can show using concentration of measure. Next we use that the true covariance is known to have converging eigenvalues, as it constitutes a converging series. This means we know that the k-th eigenvalue of the true covariance must have an upper bound in terms of its size which is related to k. Since we also know that the eigenvalues of the two matrices will be close, and assuming that we have a few strong signals (i.e., O(1) large eigenvalue), we can then bound the condition number as the ratio of the largest over the smallest eigenvalue.
For the following analysis, we will first need to recapitulate some well known 2 results about Mercer’s kernels, which can be found e.g., in [47]. If f ∈ Lν (X) is a function in the Hilbert space of square integrable functions on X with Borel 2 measure ν, and {φ1,φ2,...} is a Hilbert basis of Lν (X), f can be uniquely written ∞ N 2 as f = ∑k=1 akφk, and the partial sums ∑k=1 akφk converge to f in Lν (X). If this convergence holds in C(X), the space of continuous functions on X, we say that the series converges uniformly to f . If furthermore ∑k |ak| converges, then we say that the series ∑k ak converges absolutely. Let now K : X × X → R be a continuous function. Then the linear map
2 LK : Lν (X) → C(X) 3.3. Analysis of quantum machine learning algorithms 53 given by the following integral transform
Z 0 0 0 (LK f )(x) = K(x,x ) f (x )dν(x ) is well defined. It is well known that the integral operator and the kernel have the following relationship:
Theorem 2 ([47][Theorem 1, p. 34; first proven in [60]). ] Let X be a compact domain or a manifold, ν be a Borel measure on X , and K : X × X → R a Mer- cer kernel. Let λk be the kth eigenvalue of LK and {φk}k≥1 be the corresponding eigenvectors. Then, we have for all x,x0 ∈ X
∞ 0 0 K(x,x ) = ∑ λkφk(x)φk(x ), k=1 where the convergence is absolute (for each x,x0 ∈ X × X ) and uniform (on X × X ).
H Note that the kernel here takes the form K = Φ∞Φ∞ , which for the linear Ker- nel has the form K = XXH. Furthermore, the kernel matrix must have a similar H spectrum to the empirical or sample kernel matrix Kn = XnXn , and indeed for dtimesn limn→∞ Kn → K, where we use here the definition Xn ∈ R , i.e., we have n d vectors of dimension d and therefore X = R . The function K is said to be the kernel of LK and several properties of LK follow from the properties of K. Since we want to understand the condition number of Kn, the sample covariance matrix, we need to study the behaviour of its eigenvalues. For this, we start by studying the eigenvalues of K. First, from Theorem 2 the next corollary follows.
Corollary 3 ([47], Corollary 3). The sum ∑k λk is convergent, and
∞ Z ∑ λk = K(x,x) ≤ ν(X)CK, (3.37) k=1 X
0 where CK = supx,x0∈X |K(x,x )| is an upper bound on the kernel. Therefore, for all ν(X)CK k ≥ 1, λk ≤ k . 3.3. Analysis of quantum machine learning algorithms 54
As we see from Corollary 3, the eigenvalue λk (or singular value, since K is
SPSD) of LK cannot decrease slower than O(1/k), since for convergent series of real non-negative numbers ∑k αk, it must holds that αk must go to zero faster than 1/k. H Recalling that K is the infinite version of the kernel matrix Kn = XnXn (or generally K = ΦΦH for arbitrary kernels) we now need to relate the finite sized
Kernel Kn to the kernel K. Leveraging on concentration inequalities for random d×d d×n matrices, we will now show how Kn ∈ R for Xn ∈ R converges to K as n grows, and therefore the spectra (i.e., eigenvalues must match), which implies that the decay of the eigenvalues of Kn. Indeed, we will see that the smallest eigenvalue
λn = O(1/n) with high probability. From this we obtain immediately upper bounds on the condition number in high probability. This is summarised below.
Theorem 4. The condition number of a Mercer kernel K for a finite number of √ samples n is with high probability lower bounded by Ω( n).
Proof. We will in the following need some auxiliary results.
Theorem 5 (Matrix Bernstein [61]). Consider a finite sequence {Xk} of inde- pendent, centered, random, Hermitian d-dimensional matrices, and assume that
EXk = 0, and kXkk ≤ R, for all k ∈ [n]. Let X := ∑k Xk and EX = ∑k EXk, and let
H H σ(X) = max E[XX ] , E[X X] ( ) n n H H = max ∑ E[XkXk ] , ∑ E[Xk Xk] . (3.38) k=1 k=1
Then, −ε2/2 Pr[kXk ≥ ε] ≤ 2d exp ,∀ε ≥ 0, (3.39) σ(X) + Rε/3 We can make use of this result to straightforwardly bound the largest eigen- value of the sample covariance matrix, which is a well known result in the random matrix literature. As outlined above, the sample covariance matrix is given by
n 1 H Kn = ∑ xkxk , (3.40) n k=1 3.3. Analysis of quantum machine learning algorithms 55
d H d×d for n centred (zero mean) samples in R , and K := E xx ∈ R . We look in the following at the matrix, i.e., A := Kn − K, and assume
2 kxik2 ≤ r, ∀i ∈ [n] (3.41) i.e., the sample norm is bounded by some constant r. Typically data is sparse, and hence independent of the dimension, although both scenarios are possible. Here we assume a dependency on d. Under this assumption, we let
1 A := x xH − K, (3.42) k n k k
n for each k and hence A = ∑k=1 Ak. With the assumption in Eq. 3.41 we obtain then
H H h 2i kKk = E xx ≤ E xx = E kxk ≤ r, (3.43) using Jensen’s inequality. The matrix variance statistic σ(A) is therefore given by
R r2 0 ≤ σ(A) ≤ kC k ≤ , (3.44) n ∞ n
r which follows from straight calculation. Taking into account that kAkk ≤ n , and by invoking Thm. 5, and using the above bounds. Assuming r = C · d for some constant C, i.e., the norm of the vector will be dependent on the data dimension d (which might not be the case for sparse data!), we hence obtain that
d d d [kAk] = [kK − Kk] ≤ O √ + = O √ . (3.45) E E n n n n
Note that this is essentially O(p1/n) for n the number of samples, assuming n d, and similarly for sparse data with sparsity s = O(1) the norm will not be propor- tional to d.
We next need to relate this to the eigenvalues λmin and λmax, for which we will need the following lemma. 3.3. Analysis of quantum machine learning algorithms 56
Lemma 3. For any two bounded functions f ,g, it holds that
|inf f (x) − inf g(x)| ≤ sup| f (x) − g(x)| (3.46) x∈X x∈X x∈X
Proof. First we show that |supX f − supX g| ≤ supX | f − g|. For this, take
sup( f ± g) ≤ sup( f − g) + supg ≤ sup| f − g| + supg, X X X X X and sup(g ± f ) ≤ sup(g − f ) + sup f ≤ sup|g − g| + sup f , X X X X X and the result follows. Next we can proof the Lemma by replacing f = − f and g = −g and using that infX f = supX (− f ), the claim follows. √ Using that with high probability kKn − Kk ≤ O(d/ n + d/n), and therefore