<<

For Classical Data

Leonard P. Wossnig

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. arXiv:2105.03684v2 [quant-ph] 12 May 2021 Department of Computer Science University College London

May 13, 2021 2

I, Leonard P. Wossnig, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Abstract

In this dissertation, we study the intersection of and supervised machine learning algorithms, which means that we investigate quantum algorithms for supervised machine learning that operate on classical data. This area of re- search falls under the umbrella of learning, a research area of computer science which has recently received wide attention. In particular, we in- vestigate to what extent quantum computers can be used to accelerate supervised machine learning algorithms. The aim of this is to develop a clear understanding of the promises and limitations of the current state-of-the-art of quantum algorithms for supervised machine learning, but also to define directions for future research in this exciting field. We start by looking at supervised (QML) algorithms through the lens of statistical learning theory. In this frame- work, we derive novel bounds on the computational complexities of a large set of supervised QML algorithms under the requirement of optimal learning rates. Next, we give a new bound for Hamiltonian simulation of dense Hamiltonians, a major subroutine of most known supervised QML algorithms, and then derive a classical algorithm with nearly the same complexity. We then draw the parallels to recent ‘quantum-inspired’ results, and will explain the implications of these results for quantum machine learning applications. Looking for areas which might bear larger advantages for QML algorithms, we finally propose a novel algorithm for Quantum Boltzmann machines, and argue that quantum algorithms for quantum data are one of the most promising applications for QML with potentially exponential advantage over classical approaches. Acknowledgements

I want to thank foremost my supervisor and friend Simone Severini, who has always given me the freedom to pursue any direction I found interesting and promising, and has served me as a guide through most of my PhD. Next, I want to thank , my secondary advisor, who has always been readily available to answer my questions and discuss a variety of research topics with me. I also want to thank Carlo Ciliberto, Nathan Wiebe, and Patrick Rebentrost, who have worked closely with me and have also taught me most of the mathematical tricks and methods upon which my thesis is built. I furthermore want to thank all my collaborators throughout the years. These are in particular Chunhao Wang, Andrea Rocchetto, Marcello Benedetti, Alessan- dro Rudi, Raban Iten, Mark Herbster, Massimiliano Pontil, Maria Schuld, Zhikuan Zhao, Anupam Prakash, Shuxiang Cao, Hongxiang Chen, Shashanka Ubaru, Haim Avron, and Ivan Rungger. Almost in an equal contribution, I also want to thank Fernando Brandao,˜ Youssef Mroueh, Guang Hao Low, Robin Kothari, Yuan Su, Tongyang Li, , Kanav Setia, Matthias Troyer, and Damian Steiger for many helpful discussions, feedback, and enlightening explanations. I am particularly grateful to Edward Grant, Miriam Cha, and Ian Horobin, who made it possible for me to write this thesis. I want to acknowledge UCL for giving me the opportunity to pursue this PhD thesis, and acknowledge the kind support of align Royal Society Research grant and the Google PhD Fellowship, which gave me the freedom to work on these interesting topics. Acknowledgements 5

Portions of the work that are included in this thesis were completed while I was visiting the Institut Henri Poincare´ of the Sorbonne University in Paris. I particularly want to thank Riam Kim-McLeod for the support and help with the editing of the thesis. I finally want to thank my family for the continued love and support. Impact Statement

Quantum machine learning bears promises for many areas, ranging from the health- care to the financial industry. In today’s world, where data is available in abundance, only novel algorithms and approaches are enabling us to make reliable predictions that can enhance our life, productivity, or wealth. While Moore’s law is coming to an end, novel computational paradigms are sought after to enable a further growth of processing power. Quantum computing has become one of the prominent can- didates, and is maturing rapidly. The here presented PhD thesis develops and stud- ies this novel computational paradigm in light of existing classical solutions and thereby develops a path towards quantum algorithms that can outperform classical approaches. Contents

1 Introduction and Overview 12 1.1 Synopsis of the thesis ...... 15 1.2 Summary of our contributions ...... 16 1.3 Statement of authorship ...... 20

2 Notation And Mathematical Preliminaries 23 2.1 Notation ...... 23 2.2 Matrix functional analysis ...... 24

3 Statistical Learning Theory 29 3.1 Review of key results in Learning Theory ...... 31 3.1.1 Supervised Learning ...... 32 3.1.2 Empirical risk minimization and learning rates ...... 33 3.1.3 Regularisation and modern approaches ...... 37 3.2 Review of supervised quantum machine learning algorithms . . . . 39 3.2.1 Recap: Quantum Linear Regression and Least Squares . . . 40 3.2.2 Recap: Quantum Support Vector Machine ...... 43 3.3 Analysis of quantum machine learning algorithms ...... 44 3.3.1 Bound on the optimisation error ...... 46 3.3.2 Bounds on the sampling error ...... 49 3.3.3 Bounds on the condition number ...... 50 3.4 Analysis of supervised QML algorithms ...... 57 3.5 Conclusion ...... 58 Contents 8

4 randomised Numerical Linear Algebra 60 4.1 Introduction ...... 61 4.2 Memory models and memory access ...... 62 4.2.1 The pass efficient model ...... 62 4.2.2 Quantum random access memory ...... 64 4.2.3 Quantum inspired memory structures ...... 67 4.3 Basic matrix multiplication ...... 69 4.4 Hamiltonian Simulation ...... 74 4.4.1 Introduction ...... 75 4.4.2 Related work ...... 80 4.4.3 Applications ...... 85 4.4.4 Hamiltonian Simulation for dense matrices ...... 87 4.4.5 Hamiltonian Simulation with the Nystrom¨ method . . . . . 102 4.4.6 Beyond the Nystrom¨ method ...... 126 4.4.7 Conclusion ...... 126

5 Promising avenues for QML 130 5.1 Generative quantum machine learning ...... 131 5.1.1 Related work ...... 132 5.2 Boltzmann and quantum Boltzmann machines ...... 133 5.3 Training quantum Boltzmann machines ...... 135 5.3.1 Variational training for restricted Hamiltonians ...... 137 5.3.2 Gradient based training for general Hamiltonians ...... 143 5.4 Conclusion ...... 147

6 Conclusions 150

Appendices 156

A Appendix 1: Quantum Subroutines 156 A.1 Amplitude estimation ...... 156 A.2 The Hadamard test ...... 157 Contents 9

B Appendix 2: Deferred proofs 159 B.1 Derivation of the variational bound ...... 159 B.2 Gradient estimation ...... 161 B.2.1 Operationalizing the gradient based training ...... 163 B.3 Approach 2: Divided Differences ...... 170 B.3.1 Operationalising ...... 187

Bibliography 192 List of Figures

1.1 Different fields of study in Quantum Machine Learning. The dif- ferent areas are related to the choice of algorithm, i.e., whether it is executed on a quantum or classical computer, and the choice of the target problem, i.e., whether it operates on quantum or classical data. 13

3.1 Summary of time complexities for training and testing of differ- ent classical and quantum algorithms when statistical guarantees are taken into account. We omit polylog(n,d) dependencies for the √ quantum algorithms. We assume ε = Θ(1/ n) and count the ef- fects of measurement errors. The acronyms in the table refer to: least square support vector machines (LS-SVM), kernel ridge re- gression (KRR), quantum kernel least squares (QKLS), quantum kernel linear regression (QKLR), and quantum support vector ma- chines (QSVM). Note that for quantum algorithms the state ob- tained after training cannot be maintained or copied and the algo- rithm must be retrained after each test round. This brings a factor proportional to the train time in the test time of quantum algorithms. Because the condition number may also depend on n and for quan- tum algorithms this dependency may be worse, the overall scaling of the quantum algorithms may be slower than the classical. . . . . 59

4.1 An example of the data structure that allows for efficient state prepa- ration using a logarithmic number of conditional rotations...... 66 List of Figures 11

4.2 An example of the classical (dynamic) data structure that enables × efficient sample and query access for the example of H ∈ C2 4... 69 √ 4.3 Comparing our result O(t d kHk polylog(t,d,kHk,1/ε)) with other quantum and classical algorithms for different models. Since the qRAM model is stronger than the sparse-access model and the classical sampling and query access model, we consider the advan- tage of our algorithm against others when they are directly applied to the qRAM model...... 86

5.1 Comparison of previous training algorithms for quantum Boltz- mann machines. The models have a varying cost function (objec- tive), contain (are able to be trained with) hidden units, and have different input data (classical or quantum)...... 132 Chapter 1

Introduction and Overview

In the last twenty years, due to increased computational power and the availability of vast amounts of data, machine learning (ML) has seen an immense success, with applications ranging from computer vision [1] to playing complex games such as the Atari series [2] or the traditional game of Go [3]. However, over the past few years, challenges have surfaced that threaten the end of this revolution. The main two challenges are the increasingly overwhelming size of the available data sets, and the end of Moore’s law [4]. While novel developments in hardware architectures, such as graphics processing units (GPUs) or tensor processing units (TPUs), enable orders of magnitude improved performance compared to central processing units (CPUs), they cannot significantly improve the performance any longer, as they also reach their physical limitations. They are therefore not offering a structural solution to the challenges posed and new solutions are required.

On the other hand, a new technology is slowly reaching maturity. Quantum computing, a form of computation that makes use of quantum-mechanical phenom- ena such as superposition and entanglement, has been predicted to overcome these limitations on classical hardware. Quantum algorithms, i.e., algorithms that can be executed on a quantum computer, have been investigated since the 1980s, and have recently received increasing interest all around the world.

One area that has received particular attention is quantum machine learning see e.g. [5], the combination of quantum and machine learning. Quantum machine learning can generally be divided into four distinct areas of research: 13

CC CQ

QC QQ Problem/Data

Algorithm/computer

Figure 1.1: Different fields of study in Quantum Machine Learning. The different areas are related to the choice of algorithm, i.e., whether it is executed on a quantum or classical computer, and the choice of the target problem, i.e., whether it operates on quantum or classical data.

• Machine learning algorithms that are executed on classical computers (CPUs, GPUs, TPUs) and applied to classical data, the sector CC in Fig. 1.1,

• Machine learning algorithms that are executed on quantum computers (QPUs) and applied to classical data, the sector QC in Fig. 1.1,

• Machine learning algorithms that are executed on classical computers and applied to quantum data, the sector CQ in Fig. 1.1, and

• Machine learning algorithms that are executed on quantum computers (QPUs) and applied to quantum data, the sector QQ in Fig. 1.1.

The biggest attention has been paid to quantum algorithms that perform su- pervised machine learning on classical data. The main objective in designing such quantum algorithms is to achieve a computational advantage over any classical al- gorithm for the same task. This area is the main focus of this thesis, and we will for brevity throughout this thesis refer to this area when we speak about QML. Although becoming an increasingly popular area of research, supervised QML has in recent years faced two major challenges. 14

Firstly, initial research was aimed at designing supervised ML algorithms that have a superior performance, where performance was measured solely in terms of the computational complexity (i.e., theoretical speed) of the algorithm with respect to the input size (the problem dimension) and the desired error (or accuracy). When we are speaking about the error in this context, then we generally refer to the dis- tance (for example in norm) of the solution that we obtain through the algorithm to the true solution. For example, for most algorithms we obtain guarantees that if we run it in time ε−α , then the solution will be ε-close. Assuming that the solution is a vector x, and our algorithm produces the vectorx ˜, this would then imply that kx˜− xk ≤ ε. Note that different algorithms measure this error differently, but in general we use the spectral or Frobenius norm as measure for the error throughout this thesis. However, a challenge to this view is that it is today common wisdom in the classical ML community that faster computation is not necessarily the solution to most practical problems in machine learning. Indeed, more data and the ability to extrapolate to unseen data is the most essential component in order to achieve good performance. Mathematically, the behaviour of algorithms to extrapolate to unseen data and other statistical properties have been formalised in the field of statistical learning theory. The technical term of this ability is the so-called generalisation error of an algorithm, and it is possible to obtain bounds on this error for many common machine learning algorithms. The first research question of this thesis is therefore the following:

Research Question 1 (QML under the lens of SLT). Under the common assump- tions of statistical learning theory, what is the performance of supervised quantum machine learning algorithms?

Most quantum algorithms are described in the so-called query model. Here, the overall computational complexity, i.e., the number of steps an algorithm requires to completion, is given in terms of a number of calls (uses) of a different algorithm, which is called the oracle. Indeed, most known quantum machine learning algo- rithms heavily rely on such oracles in order to achieve a ‘quantum advantage’. A 1.1. Synopsis of the thesis 15 second challenge is therefore posed by the question how much of the quantum ad- vantage stems from the oracles and how much from the algorithms themselves. Concretely, most supervised QML methods assume the existence of a fast (log(n) for input dimension n complexity) data preparation oracle or procedure, a quan- tum equivalent of a random access memory, the qRAM [6, 7]. A closer analysis of these algorithms indeed implies that much of their power stems from this oracle, which indicates that classical algorithms can potentially achieve similar computa- tional complexities if they are given access to such a device. The second research question of this thesis is therefore the following:

Research Question 2 (QML under the lens of RandNLA). Under the assumption of efficient sampling processes for the data for both classical and quantum algorithms, what is the comparative advantage of the latter?

The goal of this thesis is therefore to investigate the above two challenges to supervised QML, and the resulting research questions. We thereby aim to assess the advantages and disadvantages of such algorithms from the perspective of statistical learning theory and randomised numerical linear algebra.

1.1 Synopsis of the thesis The thesis is structured as follows. We begin in Chapter 2 with notation and mathe- matical preliminaries of the thesis. In Chapter 3, we discuss the first challenge and associated Research Question 1, namely the ability of quantum algorithms to gen- eralise to data that is not present in the training set. Limitations arise through the fundamental requirement of the quantum measurement, and the error-dependency of most supervised quantum algorithms. We additionally discuss the condition num- ber, an additional dependency in many supervised QML algorithms. In Chapter 4 we discuss the second challenge, i.e., Research Question 2 re- garding input models. For this we discuss first how to access data with a quantum computer and how these access models compare to classical approaches, in partic- ular to randomised classical algorithms. This also includes a brief discussion about the feasibility of such memory models as well as potential advantages and disadvan- 1.2. Summary of our contributions 16 tages. We then propose a new algorithm for Hamiltonian simulation [8], a common subroutine of many supervised QML algorithms. Next, we propose a randomised classical algorithm for the same purpose, and then use this to discuss and compare the requirements of both algorithms, and how they relate to each other. Based on these insights, we next link our results to subsequent ‘dequantization’ results. This allows us to finally also make claims regarding the limitations of QML algorithms that are based on such memory models (or more generally such oracles). In Chapter 5, we argue that generative quantum machine learning algorithms that are able to generate quantum data (QQ in Fig. 1.1), are one of the most promis- ing approaches for future QML research, and we propose a novel algorithm for the training of Quantum Boltzmann Machines (QBMs) [9] with visible and hidden nodes. We note that quantum data is here defined as data that is generated by an arbitrarily-sized . We derive error bounds and computational com- plexities for the training QBMs based on two different training algorithms, which vary based on the model assumptions. Finally, in Chapter 6, we summarise our insights and discuss the main results of the thesis, ending with a proposal for further research.

1.2 Summary of our contributions Our responses to the above-posed research questions are summarised below. For Research Question 1, we obtain the following results:

• Taking into account optimal learning rates and the generalisation ability, we show that supervised quantum machine learning algorithms fail to achieve exponential speedups.

• Our analysis is based on the polynomial error-dependency and the repeated measurement that is required to obtain a classical output. We have the re- quirement that the (optimisation) error of the algorithm matches the scaling of the statistical (estimation) error that is inherent to all data-based algorithms √ and is of O(1/ n), for n being the number of samples in the data set. We observe that the performance of the algorithms in question can in practice be 1.2. Summary of our contributions 17

worse than the fastest classical solution.

• Such concerns are important for the design of quantum algorithms but are not relevant for classical ones. The computational complexity of the latter scales typically poly-logarithmic in the error ε and therefore only introduces additional O(log(n)) terms. We however acknowledge this is not generally true and it is possible to trade-off speed versus accuracy, which is for exam- ple used in early stopping. Here, one chooses to obtain a lower accuracy in order to obtain an algorithm that converges faster to the solution with the best statistical error.

• One additional challenge for many quantum algorithms is the problem of pre- conditioning, which most state-of-the-art classical algorithms such as FAL- CON [10] do take into account.

• While previous results [11] claimed that preconditioning is possible, in the quantum case, this turns out not to be true in general as, e.g. Harrow and La Placa [12] showed. Our bounds on the condition number indicate that ill-conditioned QML algorithms are even more prone to be outperformed by classical counterparts.

For Research Question 2, we obtain the following results:

• We first show that data access oracles, such as quantum random access mem- ory [6] can be used to construct faster quantum algorithms for dense Hamil- tonian simulation, and obtain an algorithm with performance (runtime) de- pending on the square-root of the dimension n of the input Hamiltonian, and √ n×n linearly on the spectral norm, i.e., O( nkHk) for Hamiltonian H ∈ C . √ For a s-sparse Hamiltonian, this reduces to time O( skHk), where s is the maximum number of non-zero elements in the rows of H.

• We next show how we can derive a classical algorithm for the same task, which is based on a classical Monte-Carlo sampling method called Nystrom¨ n×n method. We show that for a sparse Hamiltonian H ∈ C , there exists an 1.2. Summary of our contributions 18

algorithm that, with probability 1 − δ approximates any chosen amplitude of the state eiHtψ, for an efficiently describable state ψ in time

! t9 kHk4 kHk7  1 2 O sq + F log(n) + log , ε4 δ

where ε determines the quality of the approximation, k·kF is the Frobenius norm, and q is the number of non-zero elements in ψ.

• While our algorithm is not generally efficient due to the dependency on the Frobenius norm, it still removes the explicit dependency on n up to logarith- mic factors, i.e., the system dimension. We therefore obtain an algorithm that only depends on the rank, sparsity, and the spectral norm of the prob- √ lem, since kHkF ≤ r kHk, for r being the rank of H. For low-rank ma- trices, we therefore obtain a potentially much faster algorithm, and for the case q = s = r = O(polylog(n)) our algorithm becomes efficiently executable. Note that we here generally assume that N grows exponentially in the number of in the system, i.e., we only obtain an efficient algorithm if we do not have an explicit n dependency.

• This result is interesting in two different ways: Firstly, it is the first known result that applies so-called sub-sampling methods for simulating (general) Hamiltonians on a classical computer. Secondly, and more important in con- text of this thesis, our result indicates that the advantage of so-called quantum machine learning algorithms may not be as big as promised. QML algorithms such as quantum principal component analysis [13], or quantum support vec- tor machines [14] were claimed to be efficient for sparse or low rank input data, and our classical algorithm for Hamiltonian simulation is efficient un- der similar conditions. As a corollary, we indeed show that we can efficiently simulate exp(iρt) for ρ, if we can efficiently sample from the rows and columns of ρ according to a certain .

• While we did not manage to extend our results to quantum machine learn- 1.2. Summary of our contributions 19

ing algorithms, shortly after we posted our result to the ArXiv, Ewin Tang used similar methods to ‘dequantize’ [15] the quantum recommendation sys- tems algorithm by Kerenidis and Prakash [16]. While our algorithm requires the input matrix to be efficiently row-computable, Tang designed a classical memory model that allows us to sample the rows or columns of the input matrix according to their norms. Under closer inspection, these requirements are fundamentally equivalent. Furthermore, the ability to sample efficiently from this distribution is indeed similar to the ability of a quantum random ac- cess memory to prepare quantum states of the rows and columns of an input matrix. Dequantization then refers to a classical algorithm that can perform the same task as a chosen with an at most polynomial slowdown. In the subsequent few months, many other algorithms were pub- lished that achieved similar results for many other quantum machine learn- ing algorithms including quantum PCA [17], Quantum Linear Systems Algo- rithm [18, 19], and Quantum Semi-Definite Programming [20]. Most of these algorithms were unified recently in the framework of quantum singular value transformations [21].

A conclusion from the above results and the answers to the research ques- tions is that the hope for exponential advantages of quantum algorithms for ma- chine learning over their classical counterparts might be misplaced. While poly- nomial advantages may still be feasible, and most results currently indicate a gap, understanding the limitations of classical and quantum algorithms is still an open research question. As many of the algorithms we investigate in this thesis are of a theoretical nature, we want to mention that the ultimate performance test for any algorithm is a benchmark, and only such will give the final answers to questions of real advantage. However, this will only be possible once sufficiently large and accurate quantum computers are available, and will therefore not be possible in the near future. As our and subsequent results indicate, quantum machine learning algorithms for classical data to date appear to allow for at most polynomial speedups if any. 1.3. Statement of authorship 20

We therefore turn to an area where classical machine learning algorithms might generally be inefficient: modelling of quantum distributions, and therefore quan- tum algorithms for quantum data. We propose a method for fully quantum gen- erative training of quantum Boltzmann machines that in contrast to prior art have both visible and hidden units. We base our training on the quantum relative en- tropy objective function and find efficient algorithms for training based on gradient estimations under the assumption that Gibbs state preparation for the model Hamil- tonian is efficient. One interesting feature of these results is that such generative models can in principle also be used for the state preparation, and might therefore be a useful tool to overcome qRAM-related issues.

1.3 Statement of authorship

This thesis is based on the following research articles. Notably, for many articles the order of authors is alphabetical, which is standard in theoretical computer science.

1. Danial Dervovic, Mark Herbster, Peter Mountney, Simone Severini, Na¨ıri Usher, and Leonard Wossnig. Quantum linear systems algorithms: a primer. arXiv preprint arXiv:1802.08227, 2018

2. Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018

3. Chunhao Wang and Leonard Wossnig. A quantum algorithm for sim- ulating non-sparse hamiltonians. & Computation, 20(7&8):597–615, 2020

4. Alessandro Rudi, Leonard Wossnig, Carlo Ciliberto, Andrea Rocchetto, Mas- similiano Pontil, and Simone Severini. Approximating hamiltonian dynamics with the nystrom¨ method. Quantum, 4:234, 2020 1.3. Statement of authorship 21

5. Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, and Leonard Woss- nig. Fast quantum learning with statistical guarantees. arXiv preprint arXiv:2001.10477, 2020

6. Nathan Wiebe and Leonard Wossnig. Generative training of quantum boltz- mann machines with hidden units. arXiv preprint arXiv:1905.09902, 2019

I have additionally co-authored the following articles that are not included in this thesis:

1. Hongxiang Chen, Leonard Wossnig, Simone Severini, , and Masoud Mohseni. Universal discriminative quantum neural networks. arXiv preprint arXiv:1805.08654, 2018

2. Marcello Benedetti, Edward Grant, Leonard Wossnig, and Simone Severini. Adversarial quantum circuit learning for pure state approximation. New Jour- nal of Physics, 21(4):043023, 2019

3. Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum, 3:214, 2019

4. Shuxiang Cao, Leonard Wossnig, Brian Vlastakis, Peter Leek, and Edward Grant. Cost-function embedding and dataset encoding for machine learning with parametrized quantum circuits. Physical Review A, 101(5):052309, 2020

5. I Rungger, N Fitzpatrick, H Chen, CH Alderete, H Apel, A Cowtan, A Pat- terson, D Munoz Ramo, Y Zhu, NH Nguyen, et al. Dynamical mean field theory algorithm and experiment on quantum computers. arXiv preprint arXiv:1910.04735, 2019

6. Andrew Patterson, Hongxiang Chen, Leonard Wossnig, Simone Severini, Dan Browne, and Ivan Rungger. discrimination using noisy quan- tum neural networks. arXiv preprint arXiv:1911.00352, 2019 1.3. Statement of authorship 22

7. Jules Tilly, Glenn Jones, Hongxiang Chen, Leonard Wossnig, and Edward Grant. Computation of molecular excited states on ibmq using a discrimina- tive variational quantum eigensolver. arXiv preprint arXiv:2001.04941, 2020 Chapter 2

Notation And Mathematical Preliminaries

2.1 Notation

n We denote vectors with lower-case letters. For a vector x ∈ C , let xi denotes the i-th element of x. A vector is sparse if most of its entries are 0. For an integer k, let [k] denotes the set {1,...,k}.

m×n j For a matrix A ∈ C let A := A:, j, j ∈ [n] denote the j-th column vector of

A, Ai := Ai,:, i ∈ [m] the i-th row vector of A, and Ai j := A(i, j) the (i, j)-th element.

We denote by Ai: j the sub-matrix of A that contains the rows from i to j. The supremum is denoted as sup and the infimum as inf. For a measure space (X,Σ, µ), and a measurable function f an essential upper bound of f is defined ess −1 −1 as Uf := {l ∈ R : µ( f (l,∞)) = 0}, if the measurable set f (l,∞) is a set of measure zero, i.e., if f (x) ≤ l for almost all x ∈ X. Then the essential supremum ess k n is defined as ess sup f := inf Uf . We let the span of a set S = {vi}1 ⊆ C be de-  n k k fined by span{S} := x ∈ C |∃{αi}1 ⊆ C with x = ∑i=1 αivi . The set is linearly m×n independent if ∑i αivi = 0 if and only if αi = 0 for all i. The range of A ∈ C m n n is defined by range(A) = {y ∈ R : y = Ax for some x ∈ C } = span(A1,...,A ). Equivalently the range of A is the set of all linear combinations of the columns of A. The nullspace null(A) (or kernel ker(A)) is the set of vectors such that Av = 0. k n b Given a set S = {vi}1 ⊆ C . The null space of A is null(A) = {x ∈ R : Ax = 0}. 2.2. Matrix functional analysis 24

m×n The rank of a matrix A ∈ C , rank(A) is the dimension of range(A) and is equal to the number of linearly independent columns of A; Since this is equal to rank(AH), AH being the complex conjugate transpose of A, it also equals the number of linearly independent rows of A, and satisfies rank(A) ≤ min{m,n}. The trace of a matrix is the sum of its diagonal elements Tr(A) = ∑i Aii. The support of a vector supp(v) is the set of indices i such that vi = 0 and we call it sparsity of the vector. For a matrix we denote the sparsity as the number of non zero entries, while row or column sparsity refers to the number of non-zero entries per row or column. A symmetric matrix A is positive semidefinite (PSD) if all its eigenvalues are non-negative. For a PSD matrix A we write A  0. Similarly A  B is the partial ordering which is equivalent to A − B  0.

We use the following standard norms. The Frobenius norm kAkF = q m n ∗ |Ax| A A kAk = n ∑i=1 ∑ j=1 i j i j, and the spectral norm supx∈C , x6=0 |x| . Note that that 2 H  H kAkF = Tr A A = Tr AA . Both norms are submultiplicative and unitarily √ invariant and they are related to each other as kAk ≤ kAkF ≤ nkAk. The singular value decomposition of A is A = UΣV H where U,V are unitary matrices and UH defines the complex conjugate transpose, also called Hermitian conjugate, of U. We use throughout the thesis xH or AH for the Hermitian conjugate as well as for the transpose for real matrices and vectors. We denote the pseudo- inverse of a matrix A with singular value decomposition UΣV H as A+ := VΣ+UH.

2.2 Matrix functional analysis

While computing the gradient of the average log-likelihood is a straightforward task when training ordinary Boltzmann machines, finding the gradient of the quantum relative entropy is much harder. The reason for this is that in general

[∂θ H(θ),H(θ)] 6= 0. This means that the ordinary rules that are commonly used in calculus for finding the derivative no longer hold. One important example that we will use repeatedly is Duhamel’s formula:

Z 1 H(θ) H(θ)s H(θ)(1−s) ∂θ e = dse ∂θ H(θ)e . (2.1) 0 2.2. Matrix functional analysis 25

This formula can be easily proven by expanding the exponential in a Trotter-Suzuki expansion with r time-slices, differentiating the result and then tak- ing the limit as r → ∞. However, the relative complexity of this expression com- pared to what would be expected from the product rule serves as an important re- minder that computing the gradient is not a trivial exercise. A similar formula also exists for the logarithm as shown further below. Similarly, because we are working with functions of matrices here we need to also work with a notion of monotonicity. We will see that for some of our approx- imations to hold we will also need to define a notion of concavity (in order to use Jensen’s inequality). These notions are defined below.

Definition 1 (Operator monoticity). A function f is operator monotone with respect to the semidefinite order if 0  A  B, for two symmetric positive definite operators implies, f (A)  f (B). A function is operator concave w.r.t. the semidefinite order if c f (A)+(1−c) f (B)  f (cA+(1−c)B), for all positive definite A,B and c ∈ [0,1].

We now derive or review some preliminary equations that we will need in order to obtain a useful bound on the gradients in the main work.

Claim 1. Let A(θ) be a linear operator which depends on the parameters θ. Then

∂ ∂σ A(θ)−1 = −A−1 A−1. (2.2) ∂θ ∂θ

Proof. The proof follows straight forward by using the identity I.

∂I ∂ ∂A ∂A−1  = 0 = AA−1 = A−1 + A . ∂θ ∂θ ∂θ ∂θ

Reordering the terms completes the proof. This can equally be proven using the Gateau derivative.

In the following we will furthermore rely on the following well-known inequal- ity:

n×n n×n Lemma 1 (Von Neumann Trace Inequality). Let A ∈ C and B ∈ C with sin- n n gular values {σi(A)}i=1 and {σi(B)}i=1 respectively such that σi(·) ≤ σ j(·) if i ≤ j. 2.2. Matrix functional analysis 26

It then holds that n |Tr(AB)| ≤ ∑ σ(A)iσ(B)i. (2.3) i=1

Note that from this we immediately obtain

n |Tr(AB)| ≤ ∑ σ(A)iσ(B)i ≤ σmax(B)∑σ(A)i = kBk∑σ(A)i. (2.4) i=1 i i

This is particularly useful if A is Hermitian and PSD, since this implies |Tr(AB)| ≤ kBkTr(A) for Hermitian A.

Since we are dealing with operators, the common chain rule of differentiation does not hold generally. Indeed the chain rule is a special case if the derivative of the operator commutes with the operator itself. Since we are encountering a term of the form logσ(θ), we cannot assume that [σ,σ 0] = 0, where σ 0 := σ (1) is the derivative w.r.t., θ. For this case we need the following identity similarly to Duhamels formula in the derivation of the gradient for the purely-visible-units .

Lemma 2 (Derivative of matrix logarithm [34]).

1 d Z dA logA(t) = [sA + (1 − s)I]−1 [sA + (1 − s)I]−1. (2.5) dt dt 0

For completeness we here include a proof of the above identity.

Proof. We use the integral definition of the logarithm [35] for a complex, invertible, n × n matrix A = A(t) with no real negative

Z 1 logA = (A − I) ds[s(A − I) + I]−1. (2.6) 0

From this we obtain the derivative

d dA Z 1 Z 1 d logA = ds[s(A − I) + I]−1 + (A − I) ds [s(A − I) + I]−1. dt dt 0 0 dt 2.2. Matrix functional analysis 27

Applying (2.2) to the second term on the right hand side yields d dA Z 1 Z 1 dA logA = ds[s(A−I)+I]−1 +(A−I) ds[s(A−I)+I]−1s [s(A−I)+I]−1, dt dt 0 0 dt which can be rewritten as

d Z 1 dA logA = ds[s(A − I) + I][s(A − I) + I]−1 [s(A − I) + I]−1 (2.7) dt 0 dt Z 1 dA +(A − I) ds[s(A − I) + I]−1s [s(A − I) + I]−1, (2.8) 0 dt by adding the identity I = [s(A − I) + I][s(A − I) + I]−1 in the first integral and reordering commuting terms (i.e., s). Notice that we can hence just substract the first two terms in the integral which yields (2) as desired. 2.2. Matrix functional analysis 28 minimises Chapter 3

Statistical Learning Theory

This chapter applies insights from statistical learning theory to answer the following question:

Research Question 1 (QML under the lens of SLT). Under the common assump- tions of statistical learning theory, what is the performance of supervised quantum machine learning algorithms?

The main idea in this chapter is to leverage the framework of statistical learn- ing theory to understand how the minimum number of samples required by a learner to reach a target generalisation accuracy influences the overall performance of quantum algorithms. By taking into account well known bounds on this accu- racy, we can show that quantum machine learning algorithms for supervised ma- chine learning are unable to achieve polylogarithmic runtimes in the input dimen- sion. Notably, the results presented here hold only for supervised quantum ma- chine learning algorithms for which statistical guarantees are available. Our re- sults show that without further assumptions on the problem, known quantum ma- chine learning algorithms for supervised learning achieve only moderate polyno- mial speedups over efficient classical algorithms - if any. We note that the quantum machine learning algorithms that we analyse here are all based on fast quantum linear algebra subroutines [36, 37]. These in particular include quantum quan- tum support vector machines [14], quantum linear regression, and quantum least squares [38, 39, 40, 41, 37, 42]. 30

Notably, the origin of the ‘slow down’ of quantum algorithms under the above consideration is twofold.

First, most of the known quantum machine learning algorithms have at least an inversely linear scaling in the optimisation or approximation error of the algorithm, i.e., they require a computational time O(ε−α ) (for some real positive α) to return a solutionx ˜ which is ε-close in some norm to the true solution x of the problem. For example, in case of the linear systems algorithm, we obtain a quantum state |x˜i such that k|x˜i − |xik ≤ ε, where |xi is the state that encodes the exact solution to the linear system |xi = |A−1bi. This is in stark contrast to classical algorithms which typically have logarithmic dependency with respect to the error, i.e., for running the algorithm for time O(log(ε)), we obtain a solution with error ε.

Second, a crucial bottleneck in many quantum algorithms is the requirement to sample in the end of most quantum algorithms. This implies generally another error, since we need to repeatedly measure the resulting quantum state in order to obtain the underlying classical result.

We note that we mainly leverage these two sources of error in the follow- ing analysis, but the extension of this to further include noise in the computation is straightforward. Indeed, noise in the computation (a critical issue in the current gen- eration of quantum computers) could immediately be taken into account by simply adding a linear factor in terms of the error decomposition that we will encounter in Eq. 3.22. This will be a further limiting factor for near term devices as such errors need to be surpressed sufficiently in order to obtain good general bounds.

While previous research has already identified a number of caveats such as the data access [43] or restrictive structural properties of the input data, which limit the practicality of these algorithms, our insights are entirely based on statistical analysis.

As we will discuss in more detail in chapter 4, under the assumption that we can efficiently sample rows and columns of the input matrix (i.e., the input data) according to a certain distribution, classical algorithms can be shown to be nearly as efficient as these quantum machine learning algorithms [17]. We note that the 3.1. Review of key results in Learning Theory 31 scaling of the quantum machine learning algorithms typically still achieve a high polynomial advantage compared to the classical ones.

This chapter is organised as follows. First, in Section 3.1, we will review ex- isting results from statistical learning theory in order to allow the reader to follow the subsequent argument. Section 3.3.1 takes into account the error that is intro- duced by the algorithm itself. In Section 3.3.2, we then use this insight to bound the error that is induced through the sampling process in . An ad- ditional dependency is typically given by the condition number of the problem, and we hence derive additional bounds for it in Section 3.3.3. Finally, in Section 3.4, we accumulate these insights and use them to analyse a range of existing supervised quantum machine learning algorithms.

3.1 Review of key results in Learning Theory

Statistical Learning Theory has the aim to statistically quantify the resources that are required to solve a learning problem [44]. Although multiple types of learning settings exist, depending on the access to data and the associated error of a predic- tion, here we primarily focus on supervised learning. In supervised learning, the goal is to find a function that fits a set of input-output training examples and, more importantly, guarantees also a good fit on data points that were not used during the training process. The ability to extrapolate to data points that are previously not observed is also known as generalisation ability of the model. This is indeed the major difference between machine learning and standard optimisation processes. Although this problem can be cast into the framework of optimisation (by opti- mising a certain problem instance), setting up this instance to achieve a maximum generalisation performance is indeed one of the main objectives of machine learn- ing.

After the review of the classical part, i.e., the important points of considera- tion for any learning algorithm and the assumptions taken, we also analyse existing quantum algorithms, and their computational speedups within the scope of statisti- cal learning theory. 3.1. Review of key results in Learning Theory 32 3.1.1 Supervised Learning We now set the stage for the analysis by defining the framework of supervised learn- ing. Let X and Y be probability spaces with distribution ρ on X ×Y from which we sample data points x ∈ X and the corresponding labels y ∈ Y. We refer to X and Y as input set and output set respectively. Let ` : Y ×Y 7→ R be the loss function measur- ing the discrepancy between any two points in the input space, which is a point-wise error measure. There exist a wide range of suitable loss-functions, and choosing the appropriate one in practice is of great importance. Typical error functions are the 2 least-squares error, `sq( f (x),y) := ( f (x) − y) over Y = R for regression (generally for dense Y), or the 0 − 1 loss `0−1( f (x),y) := δ f (x),y over Y = {−1,1} for clas- sification (generally for discrete Y). The least squares loss will also use be used frequently throughout this chapter. For any hypothesis space H of measurable functions f : X 7→ Y, f ∈ H , the goal of supervised learning is then to minimise the expected risk or expected error E ( f ) := Eρ [`(y, f (x))], i.e.,

Z inf E ( f ), E ( f ) = `( f (x),y))dρ(x,y). (3.1) f ∈H X×Y

We hence want to minimise the expected prediction error for a hypothesis f : X → Y, which is the average error with respect to the probability distribution ρ. If the loss function is measurable, then the target space is the space of all measurable functions. The space of all functions for which the expected risk is well defined is called the target space, and typically denoted by F . For many loss functions we can in practice not achieve the infimum, however, it is still possible to derive a minimizer. In order to be able to efficiently find a solution to Eq. 3.1, rather than searching over the entirety of F , we restrict the search over a restricted hypothesis space H , which indeed can be infinite. In summary, a learning problem is defined by the following three components:

1. A probability space X ×Y with a Borel probability measure ρ.

2. A measureable loss function ` : Y ×Y 7→ [0,∞).

3. A hypothesis space H from which we choose our hypothesis f ∈ H . 3.1. Review of key results in Learning Theory 33

d The data or input spaces X can be vector spaces (linear spaces) such as X = R , d ∈ N, or structured spaces, and the output space Y can also take a variety of forms as we mentioned above. In practice, the underlying distribution ρ is unknown and we can only access it n through a finite number of observations This finite set of samples, Sn = {(xi,yi)}i=1, xi ∈ X, yi ∈ Y, is called the training set, and we generally assume that these are sam- pled identically and independently according to ρ. This is given in many practical cases but it should be noted that this is not generally true. Indeed, for example for time series, subsequent samples are typically highly correlated, and the following analysis hence does not immediately hold. The assumption of independence can however be relaxed to the assumption that the data does only depend slightly on each other via so-called mixing conditions. Under such assumptions, most of the following results for the independent case still hold with only slight adaptations. The results throughout this chapter rely on the following assumptions.

Assumption 1. The probability distribution on the data space X ×Y can be factor- ized into a marginal distribution ρX on X and a conditional distribution ρ(·|x) on Y.

We add as a remark the observation that the probability distribution ρ can take into account a large set of uncertainty in the data. The results therefore hold for a range of noise types or partial information.

Assumption 2. The probability distribution ρ is known only through a finite set of n samples Sn = {xi,yi}i=1, xi ∈ X, yi ∈ Y which are sampled i.i.d. according to the Borel probability measure ρ on the data space X ×Y.

3.1.2 Empirical risk minimization and learning rates

Under the above assumptions, the goal of a learning algorithm is then to choose a suitable hypothesis fn : X 7→ Y, fn ∈ H for the minimizer of the expected risk based on the data set Sn. Empirical Risk Minimization (ERM) approaches this problem 3.1. Review of key results in Learning Theory 34 by choosing a hypothesis that minimises the empirical risk,

1 inf En( f ), En( f ) := `( f (xi),yi), (Empirical Risk) (3.2) f ∈H n ∑ (xi,yi)∈Sn

n n given the i.i.d. drawn data points {(xi,yi)}i=1 ∼ ρ . Note that

1 n n n ( n( f )) = [`( f (xi),yi)] = ( f ), E{x,y}i=1∼ρ E ∑ E(xi,yi)∼ρ E n i=1 i.e., the expectation of the empirical risk is the expected risk, which implies that we can indeed use the empirical risk as proxy for the expected risk in expectation.

While we would like to use the empirical risk as a proxy for the true risk, we also need to ensure by minimising the empirical risk, we actually find a valid so- lution to the underlying problem, i.e., the fn that we find is approaching the true solution f∗ which minimises the empirical risk. This requirement is termed consis- tency. To define this mathematically, let

fn := argmin En( f ) (3.3) f ∈H be the minimizer of the empirical risk, which exists under weak assumptions on H . Then, the overall goal of a learning algorithm is to minimise the excess risk,

E ( fn) − inf E ( f ), (Excess risk) (3.4) f ∈H while ensuring that fn is consistent for a particular distribution ρ, i.e., that

  lim E ( fn) − inf E ( f ) = 0. (Consistency) (3.5) n→∞ f ∈H

Since Sn is a randomly sampled subset, we can analyse this behaviour in ex- pectation, i.e.,   lim E E ( fn) − inf E ( f ) = 0 (3.6) n→∞ f ∈H 3.1. Review of key results in Learning Theory 35 or in probability, i.e.,

  lim Prρn E ( fn) − inf E ( f ) > ε = 0 (3.7) n→∞ f ∈H for all ε > 0. If the above requirement holds for all distributions ρ on the data space, then we say that it is universally consistent. However, consistency is not enough in practice, since the convergence of the risk of the empirical risk minimizer and the minimal risk could be impracticably slow. One of the most important questions in a learning setting is therefore how fast this convergence happens, which is defined through the so-called learning rate, i.e., the rate of decay of the excess risk. We assume that this scales somewhat with respect to n, as for example

  −α E E ( fn) − inf E ( f ) = O(n ). (3.8) f ∈F

This speed of course has a practical relevance and hence allows us to compare dif- ferent algorithms. The sample complexity n must depend on the error we want to achieve, and in practice we can therefore define it as follows: For a distribution ρ, ∀δ,ε > 0, there exists a n(δ,ε) such that

  Prρn E ( fn(ε,δ)) − inf E ( f ) ≤ ε ≥ 1 − δ. (3.9) f ∈F

This n(δ,ε) is called the sample complexity. One challenge of studying our algorithm’s performance through the excess risk is that we can generally not assess it. We therefore need to find a way estimate or bound the excess risk solely based on the hypothesis fn and the empirical risk En. Let us for this assume the existence of a minimizer

f∗ := inf E ( f ) (3.10) f ∈H over a suitable Hypothesis space H . 3.1. Review of key results in Learning Theory 36

We can then decompose the excess risk as

E ( fn) − E ( f∗) = E ( fn) − En( fn) + En( fn) − En( f∗) + En( f∗) − E ( f∗). (3.11)

Now observe that En( fn) − En( f∗) ≤ 0, we immediately see that we can bound this quantity as

E ( fn) − E ( f∗) = E ( fn) − inf E ( f ) ≤ 2 sup |E ( f ) − En( f )|, (3.12) f ∈H f ∈H which implies that we can bound the error in terms of the so-called generalisation error E ( f ) − En( f ). We therefore see that we can study the convergence of the excess risk in terms of the generalisation error. Controlling the generalisation error is one of the main objectives of statistical learning theory. A fundamental result in statistical learning theory, which is often referred as the fundamental theorem of statistical learning, is the following:

Theorem 1 (Fundamental Theorem of Statistical Learning Theory [45, 46, 44]). Let H be a suitably chosen Hypothesis space of functions f : X 7→ Y,X ×Y be a probability space with a Borel probability measure ρ and a measurable loss func- tion ell : Y ×Y 7→ [0,∞), and let the empirical risk and risk be defined as in Eq. 3.1 and Eq. 3.2 respectively. Then, for every n ∈ N, δ ∈ (0,1), and distributions ρ, with probability 1 − δ it holds that

r ! c(H ) + log(1/δ) sup |En( f ) − E ( f )| ≤ Θ , (3.13) f ∈H n where c(H ) is a measure of the complexity of H (such as the VC dimension, covering numbers, or the Rademacher complexity [47, 44]).

We note that lower bounds for the convergence of the empirical to the (ex- pected) risk are much harder to obtain in general, and generally require to fix the underlying distribution of the data. We indeed need for the following proof a lower bound on this complexity which is not generally available. However, for the specific 3.1. Review of key results in Learning Theory 37 problems we are treating here this holds true and lower bounds do indeed exist. For simplicity, we will rely on the above given bound, as the other factors that occur in these bounds only play a minor role here.

3.1.3 Regularisation and modern approaches

The complexity of the hypothesis space, c(H ) in Eq. 3.13 relates to the phe- nomenon of overfitting, where a large hypothesis space results in a low training error on the empirical risk, but performs poorly on the true risk. In the literature, this problem is addressed with so-called regularisation tech- niques, which are able to limit the size of the Hypothesis space and thereby its complexity, in order to avoid overfitting the training dataset. A number of different regularisation strategies have been proposed in the lit- erature [45, 48, ?], including the well-established Tikhonov regularisation, which directly imposes constraints on the hypotheses class of candidate predictors. From a computational perspective, Tikhonov regularisation, and other similar approaches compute a solution for the learning problem by optimising a constraint objective (i.e., the empirical risk with an additional regularisation term). The solu- tion is obtained by a sequence of standard linear algebra operations such as matrix multiplication and inversion. Since the standard matrix inversion time is O(n3) for a n×n square matrix, most of the solutions such as for GP or SVM, can be found in O(n3) computational time for n data points. Notably, improvements to this runtime exist based on exploiting sparsity, or trading an approximation error against a lower computational time. Additionally, the time to solution typically depends on the con- ditioning of the matrix, which therefore can be lowered by using preconditioning methods. Regularisation is today widely used, and has led to many popular machine learning algorithms such as Regularised Least Squares [47], Gaussian Process (GP) Regression and Classification [49], Logistic Regression [48], and Support Vector Machines (SVM) [45]. All the above mentioned algorithms fall under the same umbrella of kernel methods [50]. To further reduce the computational cost, modern methods leverage on the fact 3.1. Review of key results in Learning Theory 38 that regularisation can indeed be applied implicitly through incomplete optimisa- tion or other forms of approximation. These ideas have been widely applied in the so-called early stopping approaches which are today standardly used in practice. In early stopping, one only performs a limited number of steps of an iterative optimi- sation algorithm, typically in gradient based optimisation. It can indeed be shown for convex functions, that this process avoids overfitting the training set (i.e., main- tains an optimal learning rate), while the computational time is drastically reduced. All regularisation approaches hence achieve a lower number of required operations, while maintaining similar or the same generalisation performance of approaches such as Tikhonov regularisation [?], in some cases provably.

Other approaches include the divide and conquer [51] approach or Monte- Carlo sampling (also-called sub-sampling) approaches. While the former is based on the idea of distributing partitions of the initial training data, training different pre- dictors on the smaller problem instances, and then combining individual predictors into a joint one, the latter achieves a form of dimensionality reduction through sam- pling a subset of the data in a specific manner. The most well-known sub-sampling methods are random features [52] and so called Nystrom¨ approaches [53, 54].

In both cases, computation benefits from parallelisation and the reduced dimension of the datasets while similarly maintaining statistical guarantees (e.g.,[55]).

For all the above mentioned training methods, the computational times can typ- ically be reduced from the O(n3) of standard approaches to Oe(n2) or Oe(nnz), where nnz is the number of non-zero entries in the input data (matrix), while maintaining optimal statistical performance.

Since standard regularisation approaches can trivially be integrated into quan- tum algorithms - such as regularised least squares - certain methods appear not to work in the quantum algorithms toolbox.

For example, preconditioning, as a tool to reduce the condition number and make the computation more efficient appears not to have an efficient solution yet in the quantum setting [12]. Therefore, more research is required to give a full picture 3.2. Review of supervised quantum machine learning algorithms 39 of the power and limitations of algorithms with respect to all parameters. Here we offer a brief discussion of the possible effects of inverting badly conditioned matrices and how typical cases could affect the computational complexity, i.e., the asymptotic runtime of the algorithm.

3.2 Review of supervised quantum machine learning algorithms

The majority of proposed supervised quantum machine learning algorithms are based on fast linear algebra operations. Indeed, most quantum machine learn- ing algorithms that claim an exponential improvement over classical counter- parts are based on a fast quantum algorithm for solving linear systems of equa- tions [38, 39, 14, 40, 41, 37, 42]. This widely used subroutine is the HHL algorithm, a quantum linear system solver [36] (QLSA), which is named after its inventors Harrow, Hassidim, and Lloyd. The HHL algorithm takes as input the normalised n n×n state |bi ∈ R and a s(A)-sparse matrix A ∈ R , with spectral norm kAk ≤ 1 and condition number κ = κ(A), and returns as an output a quantum state |w˜i which en- − n codes an approximation of the normalised solution |wi = |A 1bi ∈ R for the linear system Aw = b such that k|w˜i − |wik ≤ γ, (3.14) for error parameter γ. Note that above we assumed that the matrix is invertible, how- ever, the algorithm can in practice perform the Moore-Penrose inverse (also known n×m + H − H as pseudoinverse), which is defined for arbitrary A ∈ R by A := (A A) 1A , and using the singular value decomposition of A = UΣV H, we hence have

A+ = (VΣ2V H)−1VΣUH = VΣ−1UH, (3.15)

+ H such that A A = VV = Im.

The currently best known quantum linear systems algorithm, in terms of com- 3.2. Review of supervised quantum machine learning algorithms 40 putational complexity, runs in

O(kAkF κ polylog(κ,n,1/γ)), (3.16)

√ √ time [37], where kAkF ≤ nkAk2 ≤ n is the Frobenius norm of A and κ its con- dition number. As we will discuss in more detail in Chapter 4 on randomised nu- merical linear algebra, such computations can also be done exponentially faster compared to known classical algorithms using classical randomised methods in combination with a quantum-inspired memory structure, by taking advantage of aforementioned Monte-Carlo sampling methods [56, 17, 18, 19, 21] if the data ma- trix (A) is of low rank. We note, that for full-rank matrices, an advantage is however still possible.

Before analysing the supervised machine learning algorithms with the above discussed knowledge from statistical learning theory, we will first recapitulate the quantum least squares algorithm (linear regression) and the quantum support vec- tor machine (SVM). Throughout this chapter, we use the least squares problem as a prototypical case to study the behaviour of QML algorithms, but the results ex- tend trivially to the quantum SVM and many other algorithms. In Section 3.4 we also summarise the computational complexities taking into account the statistical guarantees for all other algorithms and hence give explicit bounds.

3.2.1 Recap: Quantum Linear Regression and Least Squares

The least squares algorithm minimises the empirical risk with respect to the quadratic loss `LS( f (x),y) = ( f (x) − y)2 , (3.17) for the hypothesis class of linear functions

d H H := { f : X 7→ Y|∃w ∈ R : f (x) = w x}, (3.18) 3.2. Review of supervised quantum machine learning algorithms 41

d n with input space R and outputspace R. Given input-output samples {xi,yi}i=1, d where xi ∈ R , and yi ∈ R. The empirical risk is therefore given by

n 1 H 2 En( f ) := ∑ w xi − yi . (3.19) n i=1 the least squares problem seeks to find a vector w such that

 2 w = d ky − Xwk , argminw∈R 2 (3.20)

n n×d where y ∈ R and X ∈ R . The closed form solution of this problem is then given by w = (XHX)−1XHy, and we can hence reformulate this again into a linear systems problem of the form Aw = b, where A = (XHX) and b = XHy. We obtain the solution then by solving the linear system w = A−1b.

Since several quantum algorithms for linear regression and in particular least squares have been proposed [38, 39, 40, 41, 37, 42], which all result in a similar scaling (taking into account the most recent subroutines), we will in the following analysis use the best known result for the quantum machine learning algorithm. All approaches have in common that they make use of the quantum linear system algorithm to convert the state |ξi = |XHyi into the solution |w˜i = |(XHX)−1ξi. The fastest known algorithm [37], indeed solves (regularised) least squares or linear regression problem in time

O(kAkF κ polylog(n,κ,1/γ)), where κ2 is the condition number of XHX or XXH respectively, and γ > 0 is an error parameter for the approximation accuracy. Notably, this algorithm precludes a physical measurement of the resulting vector |w˜i, since this would immediately imply a complexity of

O(kAkF κ/γ polylog(n,κ,1/γ)).

In that sense, the algorithm does solve the classical least squares problem only to 3.2. Review of supervised quantum machine learning algorithms 42 a certain extent as the solution is not accessible in that time. Indeed it prepares a quantum state |w˜i which is γ-close to |wi, i.e.,

k|w˜i − |wik ≤ γ, and in order to recover it, we would need to take up to O(nlog(n)) samples. In the current form, we can however immediately observe that the Frobenius norm de- pendency implies that the algorithm is efficient if X is low-rank (but no necessarily non-sparse). As for all of the supervised quantum machine learning algorithms for classical input data, the quantum least squares solver requires a quantum-accessible data structure, such as a qRAM.

Notably, it is assumed that 1 ≤ XHX ≤ 1. The output of the algorithm is κ2 then a quantum state |w˜i, such that k|w˜i − |wik ≤ ε, where |wi is the true solution.

We note that other linear regression algorithms based on sample-based Hamil- tonian simulation are possible [13, 57], which result in different requirements. In- deed, for these algorithms we need to repeatedly prepare a density matrix, which is a normalised version of the input data matrix. While this algorithm has generally worse dependencies on the error γ, it is independent of the Frobenius norm [39]. The computational complexity in this case is

O(κ2γ−3polylog(n)), which can likely be improved to O(κγ−3polylog(n,κ,γ−1)).

However, since our analysis will indeed show that a higher polynomial depen- dency will incur a worse runtime once we take statistical guarantees into account, we will use the algorithm in the following which has the lowest dependency. No- tably, the error dependency is in either case polynomial if we require a classical solution to be output.

As a further remark, other linear regression or least squares quantum algo- rithms exist [40, 42], but we will not include these here as our results can easily be extended to these as well. 3.2. Review of supervised quantum machine learning algorithms 43

Next, we will also recapitulate the quantum support vector machine.

3.2.2 Recap: Quantum Support Vector Machine

The second prototypical quantum machine learning algorithm which we want to recapitulate is the quantum least-squares support vector machine [14] (qSVM). As we will see, the procedure for the qSVM is similar to the quantum least squares approach, and therefore results in very similar runtimes. The qSVM algorithm is calculating the optimal separating hyperplane by solving again a linear system of n d equations. For n points Sn = {(xi,yi)}i=1 with xi ∈ R ,yi = {±1}, and again assum- ing that we can efficiently prepare states corresponding to the data vectors, then the least-squares formulation of the solution is given by the linear system of the form

     H 0 ~1 w0 0    =  , (3.21) ~1 K + δ −1I w y

H H where Ki j = xi x j (or Ki j = φ(xi) φ(x j) respectively for a non-linear features) is the H kernel matrix, y = (y1,...,yn) , ~1 is the all-ones vector, and δ is a user specified parameter. We note that certain authors argue that a least square support vector ma- chine is not truly a support vector machine, and their practical use highly restricted. The additional row and column in the matrix on the left hand side arise because of a H H H n non-zero offset. Notably, w x + w0 > 1 or w x + w0 < −1, with w x = ∑ j=1 w jx j determines the hyperplanes. The solution is hence obtained by solving the linear systems using the HHL algorithm based on the density matrix exponentiation [13] method previously mentioned. The only adaptation which is necessary is to use the normalised Kernel Kˆ = K/tr(K). However, since the smallest eigenvalues of Kˆ will be of O(1/n) due to the normalisation, the quantum SVM algorithm truncates the eigenvalues which are below a certain threshold δK, s.t., δK ≤ |λi| ≤ 1, which results in an effective condition number κe f f = 1/δK, thereby effectively implementing a form of spectral filtering. The runtime of the quantum support vector machine is given by

3 −3 O(κe f f γ polylog(nd,κ,1/γ)), 3.3. Analysis of quantum machine learning algorithms 44

H and outputs a state |w˜ni that approximates the solution wn := [w0, w] , such that k|w˜ni − |wnik ≤ γ. Similar as for the least squares algorithm, we cannot retrieve the parameters without an overhead, and the quantum SVM therefore needs to perform immediate classification.

3.3 Analysis of quantum machine learning algo- rithms

The quantum algorithms we analyse throughout this chapter rely on a range of pa- rameters, which include the input dimension n, which corresponds to the number of data points in a sample (the dimension of the individual data point is typically small so we focus on this part), the error of the algorithm with respect to the final predic- tion γ, and the condition number κ of the input data matrix. Our main objective is to understand the performance of these algorithms if we want to achieve an overall generalisation error of Θ(n−1/2).

To start, we therefore first need to return to our previous assessment of the risk, and use in the following a standard error decomposition. Let f be a hypoth- esis, and let F is the space of all measurable functions f : X 7→ Y. We define ∗ ∗ by E := inf f ∈F E ( f ) the Bayes risk, and want to limit the distance E ( f ) − E .

Let now EH := inf f ∈H E ( f ), i.e., the best risk attainable by any function in the hy- pothesis space H , where we assume in the following for simplicity that EH always admits a minimizer fH ∈ H . Note, that it is possible to remove this assumption by leveraging regularisation. We then decompose the error as:

∗ ˆ ˆ ∗ E ( f ) − E = E ( f ) − E ( f ) + E ( f ) − EH + EH − E (3.22) | {z } | {z } | {z } Optimisation error Estimation error Irreducible error √ = ξ + Θ(1/ n) + µ. (3.23)

The first term in Eq. 3.22 is the so-called optimisation error which indicates how good the optimisation procedure which generates f is, with respect to the actual minimum (infimum) of the empirical risk. This error stems from the approximations 3.3. Analysis of quantum machine learning algorithms 45 an algorithm typically makes, and relates to the γ in previous sections. The optimi- sation error can result from a variety of approximations, such as a finite number of steps in an iterative optimisation process or a sample error introduced through a non- deterministic process. This error is discussed in detail in Section 3.3.1. The second term is the estimation error which is due to taking the empirical risk as a proxy for the true risk by using samples from the distribution ρ. This can be bound by the generalisation bound we discussed in Eq. 3.13. The last term is the irreducible error which measures how well the hypothesis space describes the problem. If the true solution is not in our hypothesis space, there will always be an irreducible error that we indicate with the letter µ. If µ = 0, i.e., irreducible error is zero, then we call H is universal. For simplicity, we assume here that µ = 0, as it also will not impact the results of this paper much.

From the error decomposition in Eq. 3.22 we see that in order to achieve the best possible generalisation error overall, we need to make sure that the different error contributions are of the same order. We therefore in particular need to ensure that the optimisation error matches the scaling of the estimation error. Since for most known classical algorithms, with the exceptions of e.g., Monte-Carlo algo- rithms, the optimisation error typically scales with O(log(1/ε)) and matching the bounds is usually trivial. However, many quantum algorithms,including some of the quantum linear regression and least squares algorithms we discussed in the previous section (e.g. [14, 39]), have a polynomial dependency on the optimisation error. In the next section we discuss the implications of matching the bounds, and how they affect the algorithms computational complexity.

Notably, other quantum algorithms have only a polylogarithmic error depen- dency, such as [37], and therefore the error matching does not impose any critical slowdown. In these cases, however, we will see that quantum algorithms argument still cannot achieve a polylogarithmic runtime in the dimension of the training set due to the error resulting from the finite sampling process that is required to extract a classical output from a given quantum state.

Finally, to take into account all dependencies of the quantum algorithms, we 3.3. Analysis of quantum machine learning algorithms 46 also analyse the condition number. Here, we show that with high probability the condition number has a polynomial dependency on the number of samples in the training set as well, which therefore indicates a certain scaling of the computational complexity.

We do the analysis in the following exemplary for the least squares case, and summarise the resulting computational complexities of a range of supervised quan- tum machine learning algorithms next to the classical ones then in Fig. 3.1.

3.3.1 Bound on the optimisation error

As previously mentioned, we will use the quantum least squares algorithms [38, 39, 40] as an example case to demonstrate how the matching of the error affects the algorithm. The results we obtain can easily be generalised to other algorithms and instances, and in particularly hold for all kernel methods. As we try to remain general, we will do the analysis with a general algorithm with the computational complexity   Ω nα γ−β κc log(n) (3.24)

We show that in order to have a total error that scales as n−1/2, the quantum algo- rithm will pick up a polynomial n-dependency.

The known quantum least squares algorithms have a γ error guarantee for the

final output state |w˜i, i.e., k|wi − |w˜ik2 ≤ γ, where |wi is the true solution. The computational complexity (ignoring all but the error-dependency), is for all algo- rithms of the form O(γ−β ) for some β, for example [39] with β = 3, or [58] with β = 4.

Since the quantum algorithms require the input data matrix to be either Her- mitian or encoded in a larger Hermitian matrix, the dimensionality of the overall d matrix is n+d for n data points in R . For simplicity, we here assume that the input matrix is a n×n Hermitian matrix, and neglect this step. In order to achieve the best possible generalisation error, as discussed previously, we want to match the errors of the incomplete optimisation to the statistical ones. In the least squares setting, w˜ = wγ,n is the output of the algorithm corresponding to the optimal parameters fit- 3.3. Analysis of quantum machine learning algorithms 47 ted to the Sn data points, which exhibits at most a γ-error. Therefore,w ˜ corresponds to the estimator fγ,n in the previous notation that we saw in Eq. 3.22 (and Eq. 3.12).

Concretely, we can see that the total error of an estimator fγ,n on n data points with precision ε is given by

E ( fγ,n) − E ( fn) =

= E ( fγ,n) − En( fγ,n) + En( fγ,n) − En( fγ ) + En( fγ ) − E ( fγ ) | {z } | {z } | {z } generalisation error Optimisation error generalisation error −1/2 = Θ(n ) + En( fγ,n) − En( fγ ), (3.25) | {z } Optimisation error where the first contribution is a result of Eq. 3.13, i.e., the generalisation perfor- mance, and the second comes from the error of the quantum algorithm, which we will show next. In order to achieve the best statistical performance, which means to achieve the lowest generalisation error, the algorithmic error must scale at worst as the worst statistical error. We will next show that the optimisation error of a quantum algorithm in terms of the prediction results in a γ error, which is inherited from weights |w˜i, which the quantum algorithm produces. Recalling, that in least squares classification is performed via the inner product, i.e.,

H ypred := w˜ x, (3.26)

H for modelw ˜ and data point x which corresponds to fγ,n(x) in the general notation.

This then will result in the expected risk of the estimator fγ,n to be

n 1 H 2 En( fγ,n) = ∑ w˜ xi − yi . (3.27) n i=1

Therefore, assuming the output of the quantum algorithm is a statew ˜, while the exact minimizer of the empirical risk is w, s.t., k|w˜i − |wik2 ≤ γ, and assuming that 3.3. Analysis of quantum machine learning algorithms 48

|X| and |Y| are bounded, then we find that

1 n H 2 H 2 |En( fn,γ ) − En( fn)| ≤ ∑ w˜ xi − yi − w xi − yi n i=1 1 n H ≤ ∑ L (w˜ − w) xi n i=1 1 n ≤ ∑ Lkw˜ − wk2 kxik2 ≤ k · γ = O(γ), (3.28) n i=1 where k > 0 is a constant, and we used Cauchy-Schwartz, and the fact that that for the least-square it holds that

LS LS |` ( f (xi),yi) − ` ( f (x j),y j)| ≤ L|( f (xi) − yi) − ( f (x j) − y j)|, (3.29) since |X|, and |Y| bounded.

A few remarks. In the learning setting the number of samples is fixed, and hence cannot be altered, i.e., the statistical error (generalisation error) is fixed to √ Θ(1/ n), and the larger n is taken, the better the guarantees we are able to obtain for future tasks. Therefore, it is important to understand how we can reduce the other error contributions in Eq. 3.25 in order to guarantee that we have the lowest possible overall error, or accuracy.

To do so, we match the error bounds of the two contributions, so that the overall performance of the algorithm is maximised, which means that the optimisation error should not surpass the statistical error. We hence set γ = n−1/2, and see that the   overall scaling of the algorithm will need be of the order O nβ/2 , ignoring again all other contributions. To take a concrete case, for the algorithm in [39] the overall runtime is then of at least O(n3/2). The overall complexity of the algorithm then has the form   Ω nα nβ/2 log(n)κc (3.30) for some constant c,β,α

This straightforward argument from above can easily be generalised to arbi- trary kernels by replacing the input data x with feature vectors φ(x), where φ(·) is a 3.3. Analysis of quantum machine learning algorithms 49 chosen feature map.

We have so far only spoken about algorithms which naturally have a polyno- mial 1/γ-dependency. However, as we previously mentioned, not all quantum algo- rithms have such an error. For algorithms which only depend polylogarithmically on 1/γ, however, the quantum mechanical nature will incur another polynomial n dependency as we will see next.

3.3.2 Bounds on the sampling error

So far we have ignored any error introduced by the measurement process. However, we will always need to compute a classical estimate of the output of the quantum algorithm, which is based on a repeated sampling of the output state. As this is an inherent process which we will need to perform for any quantum algorithm, the following analysis applies to any QML algorithm with classical output. Since we estimate the result by repeatedly measuring the final state of our quantum compu- tation in a chosen basis, our resulting estimate for the desired output is a random variable. It is well known from the central limit theorem, that the sampling error √ for such a random variable scales as O(1/ m), where m is the number of indepen- dent measurements. This is known as the standard quantum limit or the so-called shot-noise limit. Using so-called it is sometimes possible to overcome this limit and obtain an error that scales with 1/m. This however poses the ultimate limit to measurement precision which is a direct consequence of the Heisenberg [6, 7].

Therefore, any output of the quantum algorithm will have a measurement error τ. Let us turn back to our least squares quantum algorithm. It produces a state |w˜i which is an approximation to the true solution |wi. Using techniques such as quantum state tomography we can produce a classical estimatew ˆ of the vectorw ˜ with accuracy

kw˜ − wˆk2 ≤ τ = Ω(1/m), (3.31) where m is the number of measurements performed. If y = wHx is the error-free (ideal) prediction, then we can hence only produce an approximationy ˆ = wˆHx, such 3.3. Analysis of quantum machine learning algorithms 50 that

|y − yˆ| = |wHx − wˆHx| (3.32)

≤ kw − w˜ + τk kxk (3.33)

≤ (γ + τ) kxk, (3.34) using again Cauchy-Schwartz. Similar to the previous approach, we need to make sure that the contribution coming from the measurement error scales at most as the worst possible generalisation error, and hence set τ = n−1/2. From this, we imme- diately see that any quantum machine learning algorithm which is to reach opti- mal generalisation performance, will require a number of m = Ω(n1/2) repetitions, which is hence a lower bound for all supervised quantum machine learning algo- rithms. For algorithms which do not take advantage of forms of advanced quantum metrology, this might even be Ω(n). Putting things together, we therefore have a scaling of any QML algorithm of

  Ω nα+(1+β)/2 log(n)κc , (3.35) which for the state-of-the-art quantum algorithm for quantum least squares [37] result in Ωn2κ log(n). (3.36)

In order to determine the overall complexity, we hence only have one parameter left: κ. However, already now we observe that the computational complexity is similar or even worse compared to the best classical machine learning algorithms.

3.3.3 Bounds on the condition number

In the following we will do the analysis of the last remaining depedency of the quantum algorithms. The condition number. Let the condition number dependency c + of the QML algorithm again be given by κ for some constant c ∈ R . Note that the best known result has a c = 1 dependency, ignoring logarithmic dependencies. We can think of the following three scenarios for the condition number. 3.3. Analysis of quantum machine learning algorithms 51

1. Best case scenario: In the best case setting, the condition number is one or sufficiently close to one. This is the lower bound and can only ever happen if the data is full rank and all the eigenvalues are of very similar size, i.e.,

λi ≈ λ j for all i, j. However, for such cases, it would be questionable whether a machine learning algorithm would be useful, since this would imply that the data lacks any strong signal. In these cases the quantum machine learning algorithms could be very fast and might give a quantum advantage if the n- scaling due to the error-dependency is not too high.

2. Worst case scenario: On the other extreme, the condition number could be in- finite, as could be the case for very badly conditioned matrices with smallest eigenvalues approaching 0. This can be the case if we have one or a few strong signals (i.e., eigenvalues which are closer to 1), and a small additional noise which results in the smallest eigenvalues being close to 0. Such ill condi- tioned systems do indeed occur in practice, but can generally be dealt with by using spectral-filtering or preconditioning methods, as for example discussed in [59]. Indeed, the quantum SVM [14] or the HHL algorithm [36] do or can readily make use of such methods. Concretely they do only invert eigenvalues which are above a certain threshold. This hence gives a new, effective con-

dition number κe f f = σmax/σthreshold ≤ 1/σthreshold which is typically way smaller compared to the actual κ, and makes algorithms practically useful. However, it should be noted that quantum algorithms which perform such steps need to be compared against corresponding classical methods. Note, that such truncations (filters) typically introduce an error, which then needs to be taken into account separately. Having covered these two extreme sce- narios, we can now focus on a typical case.

3. A plausible case: While the second case will appear in practice, these bounds give little insight into the actual performance of the quantum machine learn- ing algorithms, since we cannot infer any scaling of κ from them. However, for kernel based methods, we can derive a plausible case which can give us some intuition of how bad the κ-scaling typically can be. We will in the 3.3. Analysis of quantum machine learning algorithms 52

following show that with high probability, the condition number for a kernel method can have a certain n-dependency. Even though this result gives only a bound in probability, it is a plausible case with concrete n-dependency (the dimension of the input matrix) rather than the absolute worst case of κ = ∞, which gives an impractical upper bound. As a consequence, a quantum kernel √ method which scales as O(κ3) could pick up a factor of O(n n) in the worst case which has the same complexity as the classical state of the art.

In the following, we now prove a lower bound for the condition number of a covariance or kernel matrix assuming that we have at least one strong signal in the data. The high level idea of this proof is that the sample covariance should be close to the true covariance with increasing number of samples, which we can show using concentration of measure. Next we use that the true covariance is known to have converging eigenvalues, as it constitutes a converging series. This means we know that the k-th eigenvalue of the true covariance must have an upper bound in terms of its size which is related to k. Since we also know that the eigenvalues of the two matrices will be close, and assuming that we have a few strong signals (i.e., O(1) large eigenvalue), we can then bound the condition number as the ratio of the largest over the smallest eigenvalue.

For the following analysis, we will first need to recapitulate some well known 2 results about Mercer’s kernels, which can be found e.g., in [47]. If f ∈ Lν (X) is a function in the Hilbert space of square integrable functions on X with Borel 2 measure ν, and {φ1,φ2,...} is a Hilbert basis of Lν (X), f can be uniquely written ∞ N 2 as f = ∑k=1 akφk, and the partial sums ∑k=1 akφk converge to f in Lν (X). If this convergence holds in C(X), the space of continuous functions on X, we say that the series converges uniformly to f . If furthermore ∑k |ak| converges, then we say that the series ∑k ak converges absolutely. Let now K : X × X → R be a continuous function. Then the linear map

2 LK : Lν (X) → C(X) 3.3. Analysis of quantum machine learning algorithms 53 given by the following integral transform

Z 0 0 0 (LK f )(x) = K(x,x ) f (x )dν(x ) is well defined. It is well known that the integral operator and the kernel have the following relationship:

Theorem 2 ([47][Theorem 1, p. 34; first proven in [60]). ] Let X be a compact domain or a manifold, ν be a Borel measure on X , and K : X × X → R a Mer- cer kernel. Let λk be the kth eigenvalue of LK and {φk}k≥1 be the corresponding eigenvectors. Then, we have for all x,x0 ∈ X

∞ 0 0 K(x,x ) = ∑ λkφk(x)φk(x ), k=1 where the convergence is absolute (for each x,x0 ∈ X × X ) and uniform (on X × X ).

H Note that the kernel here takes the form K = Φ∞Φ∞ , which for the linear Ker- nel has the form K = XXH. Furthermore, the kernel matrix must have a similar H spectrum to the empirical or sample kernel matrix Kn = XnXn , and indeed for dtimesn limn→∞ Kn → K, where we use here the definition Xn ∈ R , i.e., we have n d vectors of dimension d and therefore X = R . The function K is said to be the kernel of LK and several properties of LK follow from the properties of K. Since we want to understand the condition number of Kn, the sample covariance matrix, we need to study the behaviour of its eigenvalues. For this, we start by studying the eigenvalues of K. First, from Theorem 2 the next corollary follows.

Corollary 3 ([47], Corollary 3). The sum ∑k λk is convergent, and

∞ Z ∑ λk = K(x,x) ≤ ν(X)CK, (3.37) k=1 X

0 where CK = supx,x0∈X |K(x,x )| is an upper bound on the kernel. Therefore, for all  ν(X)CK  k ≥ 1, λk ≤ k . 3.3. Analysis of quantum machine learning algorithms 54

As we see from Corollary 3, the eigenvalue λk (or singular value, since K is

SPSD) of LK cannot decrease slower than O(1/k), since for convergent series of real non-negative numbers ∑k αk, it must holds that αk must go to zero faster than 1/k. H Recalling that K is the infinite version of the kernel matrix Kn = XnXn (or generally K = ΦΦH for arbitrary kernels) we now need to relate the finite sized

Kernel Kn to the kernel K. Leveraging on concentration inequalities for random d×d d×n matrices, we will now show how Kn ∈ R for Xn ∈ R converges to K as n grows, and therefore the spectra (i.e., eigenvalues must match), which implies that the decay of the eigenvalues of Kn. Indeed, we will see that the smallest eigenvalue

λn = O(1/n) with high probability. From this we obtain immediately upper bounds on the condition number in high probability. This is summarised below.

Theorem 4. The condition number of a Mercer kernel K for a finite number of √ samples n is with high probability lower bounded by Ω( n).

Proof. We will in the following need some auxiliary results.

Theorem 5 (Matrix Bernstein [61]). Consider a finite sequence {Xk} of inde- pendent, centered, random, Hermitian d-dimensional matrices, and assume that

EXk = 0, and kXkk ≤ R, for all k ∈ [n]. Let X := ∑k Xk and EX = ∑k EXk, and let

 H H σ(X) = max E[XX ] , E[X X] ( ) n n H H = max ∑ E[XkXk ] , ∑ E[Xk Xk] . (3.38) k=1 k=1

Then,  −ε2/2  Pr[kXk ≥ ε] ≤ 2d exp ,∀ε ≥ 0, (3.39) σ(X) + Rε/3 We can make use of this result to straightforwardly bound the largest eigen- value of the sample covariance matrix, which is a well known result in the random matrix literature. As outlined above, the sample covariance matrix is given by

n 1 H Kn = ∑ xkxk , (3.40) n k=1 3.3. Analysis of quantum machine learning algorithms 55

d  H d×d for n centred (zero mean) samples in R , and K := E xx ∈ R . We look in the following at the matrix, i.e., A := Kn − K, and assume

2 kxik2 ≤ r, ∀i ∈ [n] (3.41) i.e., the sample norm is bounded by some constant r. Typically data is sparse, and hence independent of the dimension, although both scenarios are possible. Here we assume a dependency on d. Under this assumption, we let

1 A := x xH − K, (3.42) k n k k

n for each k and hence A = ∑k=1 Ak. With the assumption in Eq. 3.41 we obtain then

 H  H  h 2i kKk = E xx ≤ E xx = E kxk ≤ r, (3.43) using Jensen’s inequality. The matrix variance statistic σ(A) is therefore given by

R r2 0 ≤ σ(A) ≤ kC k ≤ , (3.44) n ∞ n

r which follows from straight calculation. Taking into account that kAkk ≤ n , and by invoking Thm. 5, and using the above bounds. Assuming r = C · d for some constant C, i.e., the norm of the vector will be dependent on the data dimension d (which might not be the case for sparse data!), we hence obtain that

 d d   d  [kAk] = [kK − Kk] ≤ O √ + = O √ . (3.45) E E n n n n

Note that this is essentially O(p1/n) for n the number of samples, assuming n  d, and similarly for sparse data with sparsity s = O(1) the norm will not be propor- tional to d.

We next need to relate this to the eigenvalues λmin and λmax, for which we will need the following lemma. 3.3. Analysis of quantum machine learning algorithms 56

Lemma 3. For any two bounded functions f ,g, it holds that

|inf f (x) − inf g(x)| ≤ sup| f (x) − g(x)| (3.46) x∈X x∈X x∈X

Proof. First we show that |supX f − supX g| ≤ supX | f − g|. For this, take

sup( f ± g) ≤ sup( f − g) + supg ≤ sup| f − g| + supg, X X X X X and sup(g ± f ) ≤ sup(g − f ) + sup f ≤ sup|g − g| + sup f , X X X X X and the result follows. Next we can proof the Lemma by replacing f = − f and g = −g and using that infX f = supX (− f ), the claim follows. √ Using that with high probability kKn − Kk ≤ O(d/ n + d/n), and therefore

√  |λmax(Kn) − λmax(K)| ≤ O d/ n , and (3.47) √  |λmin(Kn) − λmin(K)| ≤ O d/ n , (3.48)

H H which follows from Lemma 3 by taking f (x) = (x Knx)/x x and g(x) = (xHKx)/xHx. Therefore, ignoring the data dimension dependency, and recalling that the Kernel matrices are positive semi-definite, we obtain that with high proba- bility

λmax(Kn) kKnk κ(Kn) = ≥ √ (3.49) λmin(Kn) λmin(K) + O(1/ n) kK k √ = n √ = O(kK k n), (3.50) O(1/n) + O(1/ n) n where we used the bounds on the smallest eigenvalue in Corollary 3. Therefore, in √ high probability the condition number of the problem is at least O(kKnk n), and if √ we assume kKnk = 1 this is O( n) for any Kernel method. √ We hence can assume that an additional n dependency from κ can appear as a plausible case. If we additionally take into account the other sources of errors, 3.4. Analysis of supervised QML algorithms 57 which we discussed before, typical QML algorithm result in runtimes which are significantly worse than the classical counterpart. We hence learn from this analy- sis, that if we desire practically relevant QML algorithms with provable guarantees, we either need to reduce the condition number dependency (e.g., through precon- ditioning or filtering), or apply the quantum algorithm only to data sets for which we can guarantee that κ is small enough. Since some quantum algorithms allow for spectral filtering methods, and therefore limit the condition number by κe f f , we will not include the condition number scaling in the following analysis. We leave it open to the reader to apply this scaling to algorithms which exhibit a high κ-dependency. Notably, it would furthermore require a more involved analysis of the error since 2 such a truncation immediately imposes an error, e.g., kA − F(A)k2 for Hermitian H 2 A = UΣU , and filter F, could result in an error kΣ − F(Σ)k2. If the filter in the simplest form cuts of the eigenvalues below σe f f then the final error would be σe f f , and it would then be a matter of the propagation of this error through the algorithm.

3.4 Analysis of supervised QML algorithms

Our analysis is based entirely on the dependency of the statistical guarantees of the estimator on the size of the data set. We leverage on the above discussed impacts on the algorithmic error and the measurement based error as well as the previously derived results in statistical learning theory. In particular, by using that the accuracy parameter of a supervised learning problem scales inverse polynomially with the number of samples which are used for the training of the algorithm, we showed that the errors of quantum algorithms will results in poly(n) scalings. The runtimes for a range of quantum algorithms which we now derive based on these requirements indicate that these algorithms can therefore not achieve exponential speedups over their classical counterparts.

We note that this does not rule out exponential advantages for learning prob- lems where no efficient classical algorithms are known, as there exist learning prob- lems for which quantum algorithms have a superopolynomial advantage [62, 63]. One nice feature of our results is that they are independent of the model of access 3.5. Conclusion 58 to the training data, which means that the results hold even if debated access such as quantum random access memory is used. Finally, we note that our results do not assume any prior knowledge on the function to be learned, which allows us to make statements on virtually every possible learning algorithm, including neural networks. Under stronger assumptions (e.g., more knowledge of the target function) the dependency of the accuracy in terms of samples can be derived.

We summarise the results of our analysis in Table 3.1. We omit the κ de- pendency which would generally decrease the performance of the quantum algo- rithms further. Notably, while this is classically not an issue due to precondi- tioning, no efficient general quantum preconditioning algorithm exists. We note that [11] introduced a SPAI preconditioner, however without providing an efficient quantum implementation for its construction and without any performance analy- sis. We additionally note that recently [64] proposed a different mechanism for constructing efficient preconditioners, called fast inversion. The main idea is based on the fact that fast inversion of 1-sparse matrices can be done efficiently on a quantum computer. The algorithm works for matrices of the form A + B, where kBk = A−1 = (A + B)−1 = O(1), and A can be inverted fast. It results in a condition number of the QLSA of κ(M(A + B)) = κ(I + A−1B) once the precondi- tioner M = A−1 is applied.

3.5 Conclusion

Quantum machine learning algorithms promise to be exponentially faster than their classical counter parts. In this chapter, we showed by relying on standard results from statistical learning theory that such claims are not well founded, and thereby rule out QML algorithms with polylogarithmic time complexity in the input di- mensions. As any practically-used machine learning algorithms have polynomial runtimes, our results effectively rule out the possibility of exponential advantages for supervised quantum machine learning. Although this holds for polynomial run- time classical algorithms, we note that our analysis does not rule out an exponential advantage over classical algorithms with superpolynomial runtime. Furthermore, 3.5. Conclusion 59

Algorithm Train time Test time Classical LS-SVM / KRR n3 n KRR [?, 65, 66, 67, 68] n2 n Divide and conquer [51] n2 n 2 √ Nystrom¨ [54, 55] n√ √n FALKON [10] n n n √ √ n n n Quantum QKLS / QKLR [37] √ √ QSVM [14] n n n2 n

Figure 3.1: Summary of time complexities for training and testing of different classical and quantum algorithms when statistical guarantees are taken into account. We omit polylog(n,d) dependencies for the quantum algorithms. We assume √ ε = Θ(1/ n) and count the effects of measurement errors. The acronyms in the table refer to: least square support vector machines (LS-SVM), kernel ridge regression (KRR), quantum kernel least squares (QKLS), quantum kernel lin- ear regression (QKLR), and quantum support vector machines (QSVM). Note that for quantum algorithms the state obtained after training cannot be main- tained or copied and the algorithm must be retrained after each test round. This brings a factor proportional to the train time in the test time of quantum algo- rithms. Because the condition number may also depend on n and for quantum algorithms this dependency may be worse, the overall scaling of the quantum algorithms may be slower than the classical. since we do not make any assumptions on the hypothesis space H of the learning problem, we note that generally faster error rates are possible if more prior knowl- √ edge exists. It is hence possible to obtain faster convergence rates than 1/ n, which would imply a potential quantum advantage for such problems. As future directions, it is worth mentioning that it may be possible strengthen our results by analysing the n dependency of the condition number. Previous results in this direction are discussed in [47, 60]. Chapter 4 randomised Numerical Linear Algebra

The research question we answer in this chapter is the following:

Research Question 2 (QML under the lens of RandNLA). Under the assumption of efficient sampling processes for the data for both classical and quantum algorithms, what is the comparative advantage of the latter?

We will show how the requirement of a fast memory model can hide much of the computational power of an algorithm. In particular, by allowing a classical algo- rithm to sample according to a certain probability distribution, we can derive classi- cal algorithms with computational complexities which are independent of the input dimensions, and therefore only polynomially slower compared to the best known quantum algorithm for the same task. We use such Monte Carlo algorithms to con- struct fast algorithms for Hamiltonian simulation, and also connect our research to the recent so-called quantum-inspired or dequatisation results [17]. We start by defining a range of memory models, which are used in quantum algorithms, randomised numerical linear algebra, and quantum-inspired algorithms. We continue by introducing the main ideas of Monte Carlo methods for numerical linear algebra, and as an exemplary case study the randomised matrix multiplication algorithm to grasp the main ideas of such approximate methods. For a detailed introduction and overview, we refer the reader to the review of David Woodruff [69]. 4.1. Introduction 61

We then show that such a memory structure immediately leads to faster simu- lation algorithms for dense Hamiltonians in Sec. 4.4.4. In the next step, we then use these ideas in Sec. 4.4.5 to construct symmetric matrices which are approximations for the Hamiltonian and use these to perform fast classical Hamiltonian simulation. We do so by first finding a randomised low- rank approximation Hk of the Hamiltonian H, and then applying a form of approxi- mation to the series expansion of the time evolution operator exp(−iHkt). We thereby show that the ability to sample efficiently from such matrices im- mediately allows for faster classical algorithms as well. Finally we briefly discuss how our results relate to the recent stream of dequan- tisation results, and show that indeed we can achieve exponentially faster algorithms if we make similar assumptions.

4.1 Introduction

In general, there exist two different approaches to performing approximate numer- ical linear algebra operations, and a closer inspection shows that there are many parallels between them. The first stream is based on random sampling, also called sub-sampling, while the second stream is based on so-called random projections. Random projections are themselves based on the John-Lindenstrauss transforma- tion, which allow us to embed vectors in a lower dimensional subspace while pre- serving certain distance metrics. Roughly speaking, random projections correspond to uniform sampling in randomly rotated spaces. The main differences between different approaches are the resulting error bounds, where the best known ones are multiplicative relative-error bounds. Taking as example the problem of a randomised low-rank approximation, based on the randomised projector Pˆk, which projects into some rank-k subspace of a matrix, and denoting with Pk the projector into the subspace with containing the best rank-k approximation of A (e.g., the left eigenspace of the top k singular values of A), then we define additive-error bounds to be of the form

ˆ A − PkA F ≤ kA − PkAkF + ε kAkF , 4.2. Memory models and memory access 62 since we have an additional error term to the best rank-k approximation of A. These bounds are typically weaker than the so-called multiplicative-error bounds, which take the form

A − PˆkA ≤ f (m,n,k,...)kA − PkAk, where f (·) denotes a function depending on the dimensions m and n of the matrix m×n A ∈ R , the rank, or other parameters. The best known bounds on f are indepen- dent of m or n, and are referred to as constant-factor bounds. The strongest of these bounds are given for f = 1 + ε, for some error parameter ε, i.e.,

A − PˆkA ≤ (1 + ε)kA − PkAk.

4.2 Memory models and memory access

In computer science, many different models for data access and storage are used. Here we briefly recapitulate a number of memory models, which are common in randomised numerical linear algebra, quantum algorithms, and quantum inspired (or dequantised) algorithms. We will introduce the different memory structures, and define their properties. Next, we will introduce the basic ideas behind numerical linear algebra on the example of matrix multiplication.

4.2.1 The pass efficient model

Traditionally, in randomised numerical linear algebra, the so-called pass-efficient model was used to describe memory access. In this model, the only access an algorithm has to the data is via a so-called pass, which is a sequential read of the entire input data set. An algorithm is then called pass-efficient if it uses only a small or a constant number of passes over the data, while it can additionally use RAM space and additional computation sublinear with respect to the data stream to compute the solution. The data storage can take several forms, as for example one could store only the index-data pairs as a sparse-unordered representations of the data, i.e., ((i, j),Ai j) for all non-zero entries of the matrix A. In practice, these types of storage are 4.2. Memory models and memory access 63 for example implemented in the Intel MKL compressed sparse row (CSR) format, which is specified by four arrays (values, columns, pointer 1, and pointer 2). Stronger results, however, can sometimes be obtained in different input mod- els, yet these memory models are generally inaccessible and hence should be used with care. In order to efficiently sample from the data set using this memory model, we need to be able to select random samples in a pass-efficient manner. In order to do so, we can rely on the so-called SELECT algorithm, which is presented below.

Algorithm 1 The SELECT Algorithm [70]

INPUT: {a1,...,an}, ai ≥ 0, i.e., one sequential read over the data. ∗ OUTPUT: i ,ai∗ .

1: D = 0 2: for i=1 to n do 3: D = D + ai ∗ 4: With probability ai/D, let i = i and let ai∗ = ai. 5: end for ∗ 6: return i ,ai∗ .

The SELECT algorithm has then the following properties.

Lemma 4 ([71]). Suppose that {a1,...,an},ai ≥ 0, are read in one pass, then the ∗ ∗ n SELECT algorithm returns the index i with probability P[i = i] = ai/∑ j=1 a j, and requires O(1) additional storage space.

Proof. The proof is by induction. For the base case we have that the first ele- ∗ ment a1 with i = 1 is selected with probability a1/a1 = 1. The induction step k is then performed by letting Dk = ∑ j=1 a j, i.e., the first k elements have been read, and the algorithm reads the element k + 1. Hence the probability to have ∗ ∗ selected any prior i = i is P[i = i] = ai/Dk. Then the algorithm selects the index ∗ i = k +1 with probability ak+1/Dk+1, and retains the previous selection otherwise, ∗ which is done for i ∈ [k] with probability P[i = i] = P(Entry was selected prior) ∗ P(Current entry was not selected). Since by the induction hypothesis it holds that the any entry prior was selected with [i∗ = i] = ai and the probability that the new P Dk 4.2. Memory models and memory access 64     entry was not selected is 1 − ak+1 , hence we have [i∗ = i] = ai 1 − ak+1 = Dk+1 P Dk Dk+1 ai . By induction this result holds for all i and hence l + 1 = n. The storage Dk+1 space is limited to the space for keeping track of the sum, which is O(1), and hence concludes the proof.

It is important to note that we can therefore use the SELECT algorithm to perform importance sampling according to distributions over the rows Ai or columns Ai of a matrix A, as for example if we want to sample according to the probabilities

∗ 2 2 P[i = i] = kAik2 /kAkF .

We will use this sampling scheme to demonstrate how we can already obtain very fast algorithms in randomised linear algebra. In particular, we will use it to de- rive one of the most fundamental randomised algorithms in Section 4.3: The basic matrix multiplication algorithm.

4.2.2 Quantum random access memory

A classical random-access memory (RAM) is a device that stores the content of a memory location in a memory array. A random-access memory importantly allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory. Practically this is due to a bi- nary tree which allows the bus to traverse the memory consisting of N elements in log(N) computational steps by traversing the tree. Quantum random access memory (qRAM) in a similar fashion allows us to access and load data in superposition from all the memory sites. qRAMs with n-bit addresses can therefore access 2n memory sites and hence require O(2n) two-bit logic gates. Note that we, with abuse of pre- vious notation, define therefore the dimensions of a Hermitian matrix in this chapter with N, unlike in previous chapters. We do this to comply with the standard nota- tion in the quantum computing literature. Using, e.g. the so-called bucket brigade architecture the number of two- physical interactions during each qRAM call can then in principle reduced to O(n) [6, 7], which is hence polylogarithmic in the input dimension, assuming a data array of size N = 2n. 4.2. Memory models and memory access 65

A variety of different qRAM architectures have been proposed, and we will here mainly focus on the specific architecture introduced by Kerenidis and Prakash [16]. We will use this algorithm later on in context of Hamiltonian simu- lation for dense Hamiltonians on a quantum computer. In the next subsection, we will also show that such a memory model has a powerful classical correspondence which can be used to construct much faster randomised algorithms for numerical linear algebra. In the following, we introduce an adaptation of the architecture pro- posed in [16], which is suitable for the application that we will mainly study in this chapter, namely Hamiltonian simulation. Notably, while most structures store the squares of the entries (e.g., of the matrix or data in general), the here presented one stores the absolute values only.

N×N Definition 2 (Quantum Data Structure). Let H ∈ C be a Hermitian matrix n (where N = 2 ), kHk1 being the maximum absolute row-sum norm, and σ j :=

∑k|Hjk|. Each entry Hjk is represented with b bits of precision. Define D as an array of N binary trees D j for j ∈ {0,...,N − 1}. Each D j corresponds to the row

Hj, and its organization is specified by the following rules.

1 ∗ 1. The leaf node k of the tree D j stores the value Hjk corresponding to the

index-entry pair ( j,k,Hjk).

2. For the level immediately above the bottom level, i.e., the leaves, and any node level above the leaves, the data stored is determined as follows: suppose the node has two children storing data a, and b respectively (note that a and b are complex numbers). Then the entry that is stored in this node is given by (|a| + |b|).

We show an example of the above data structure in Fig. 4.1. As we immediately see from the Definition 2, while each binary tree D j in the data structure contains a real number in each internal (non-leaf) node, the leaf nodes store a complex number.

The root node of each tree D j then store the absolute column norm, i.e., the value N−1 ∗ ∑k=0 |Hjk|, and we can therefore calculate the value kHk1 − σ j in constant time.

1Note that the conjugation here is necessary. See Eq. (4.26). 4.2. Memory models and memory access 66

Furthermore, the kHk1 can be obtained as the maximum among all the roots of the binary trees, which can be done during the construction of the data structure, or through another binary search through the tree structure which ends in the D j’s.

Figure 4.1: An example of the data structure that allows for efficient state preparation using a logarithmic number of conditional rotations.

With this data structure, we can efficiently perform a state preparation which we will require in the Hamiltonian simulation algorithm in Sec. 4.4.4. Notably, as we will see in the next subsection, such a fast quantum memory has a classical equivalent which allows us to sample very efficiently from the data. There are several challenges with such a memory architecture, including a large overhead from the controlled operations, and possible challenges from de- . For a discussion of some caveats of such qRAM architectures, we point the reader to the reviews [72] or [5]. An important notion that we want to mention here is the separation between quantum algorithms which use quantum data (i.e. data which is accessible in superposition) and ones which are operating on classi- cal data which needs to be accessed efficiently to allow polylogarithmic runtimes. In general most quantum algorithms are discussed in the so-called query model, i.e., the computational complexity is given in terms of numbers of queries of an oracle. The actual oracle could then be another quantum algorithm, or a memory model such as qRAM. The computational time is then given by the query complex- ity times the time for each query call, which could for example be Tquery ×log(N) if we call the qRAM Tquery times. If a quantum algorithm requires qRAM or a simi- lar solution, then we need to understand the associated limitations and assumptions 4.2. Memory models and memory access 67 and assess these with care. Particularly, the superiority w.r.t. classical algorithms is here in question when comparing to classical PRAM machines [73]. Results that consider error probabilities in certain quantum RAM architectures further indicate that a fully-error-corrected quantum RAM is necessary to maintain speedups, which might result in huge overheads and questionable advantages [74, 75].

4.2.3 Quantum inspired memory structures Recent quantum inspired algorithms rely on a data structure which is very simi- lar to the qRAM used in many QML algorithms. While QML algorithms rely on qRAM or ’quantum access’ to data in order to allow for state preparation with lin- ear gate count but polylogarithmic depth, quantum inspired algorithms achieve their polylogarithmic complexities through sampling and query access of the data via a dynamic data structure [21]. The Sampling and query access model can be thought of as a classical analogue of the above introduced qRAM model for state preparation. The ability to prepare N a state |vi which is proportional to some input vector v ∈ C (such as a column of N×M H ∈ C ) from memory is equivalent to the ability make the following queries:

1. Given an index i ∈ [N], output the corresponding entry vi of the vector v.

2 2 2. Sample the index j ∈ [N] with probability |v j| /kvk2.

3. Output the spectral norm of the vector kvk2.

M×N For a matrix H ∈ C this extends to the ability to perform the following queries:

1. Sample and query access for each vector Hi, i.e., for all rows i ∈ [N] we can output each entry, sample indices with probability proportional to the magnitude of the entry, and output its spectral norm.

2. Sample and query access to the vector h with hi = kHik, i.e., the vector of row norms.

Notably, if the input data is given in a classical form, classical algorithms can be run efficiently in the sampling and query model whenever the corresponding quan- tum algorithms require qRAM access to the data, and both all state preparations 4.2. Memory models and memory access 68 can be performed efficiently (i.e., be performed in logarithmic time in the input dimensions). The classical data structure which enables such sampling and query access to M×N the data is similar to the one described in Fig. 4.1. It stores a matrix H ∈ C , again in form of a set of binary trees, where each tree contains the absolute values (or to be precise, the square of the absolute values) of the entries of one row or column of the input matrix H. The (time) cost of a query to any entry in H is then M×N O(1) and sampling can be performed in time O(log(MN)) for any entry H ∈ C . We summarise this data structure below in Definition 3.

N×M Definition 3 (Classical Data Structure). Let H ∈ C be a Hermitian matrix n (where N = 2 ), kHkF being the Frobenius norm, and hi = kHik2 being the spectral norm of row i of H. Each entry Hjk is represented with b bits of precision. Define

D as an array of N binary trees D j for j ∈ {0,...,N − 1}. Each D j corresponds to the row Hj, and its organization is specified by the following rules.

Hjk 1. The leaf node k of the tree D j stores the value corresponding to the |Hjk| index-entry pair ( j,k,Hjk).

2. For the level immediately above the bottom level, i.e., the leaves, the entry k 2 of the tree D j is given by Hjk

3. For any node level above the leaves, the data stored is determined as follows: suppose the node has two children storing data a, and b respectively (note that a and b are squares values of complex numbers). Then the entry that is stored in this node is given by (a + b).

The root nodes of the binary tree D j is then given by h j, and the memory structure is completed by applying the same tree structure where now the leaves j of the tree are now given by the root nodes of D j.

Figure 4.2 demonstrates an example of this structure. Although we do not explicitly rely on such a data structure for the results of section 4.4.5, since we use a less restrictive requirement which we call row- 4.3. Basic matrix multiplication 69

� !

# # h" = �" # ℎ# = �# #

# # # # # # # # �"" + �"# �"$ + �"% �#" + �## �#$ + �#%

# # # # # # # # �"" �"# �"$ �"% �#" �## �#$ �#%

�"" �"# �"$ �"% �#" �## �#$ �#% �"" �"# �"$ �"% �#" �## �#$ �#%

Figure 4.2: An example of the classical (dynamic) data structure that enables efficient sam- × ple and query access for the example of H ∈ C2 4. computability and row-searchability. However, the data structure in Def. 3 immedi- ately allows us to perform these operations. We will in this thesis just refer to this requirement as query and sample access, since it has been established as the com- monly used term in the dequantisation literature and suffices in practice. There are generally exceptions, for example, since row-computability and row-searchability can also be achieved through structural properties of the matrix which would not require the memory structure.

4.3 Basic matrix multiplication

In this section we will introduce some of the main concepts from randomised nu- merical linear algebra on the example of the randomised matrix multiplication al- gorithm. We will denote the i-th column of some matrix A by Ai and the j-th row of m×n n×p A by A j. Let in the following A ∈ R and B ∈ R . With this notation, recall n j n i that (AB)i j = ∑ j=1 Ai jB jk = AiB , and that we can write the product AB = ∑i=1 A Bi i m×p via the sum of outer products, i.e., a sum of rank-1 matrices, where A Bi ∈ R . The representation of the product AB into the sum of outer products implies that AB could be decomposed into a sum of random variables, which, if appropriately chosen, would in expectation result in the product AB. This suggests that we could sample such terms to approximate the product, specifically we could use an approx- 4.3. Basic matrix multiplication 70 imation of the form

n c it i 1 A Bit AB = ∑ A Bi ≈ ∑ = CR, (4.1) i=1 c t=1 pit

n where {pi}i=1 are the sampling probabilities. We could do this via random uniform sampling, but this would lead to a very high variance. Hence we will further opti- mally like to optimise the sampling probabilities to obtain a good result.

In general, instead of working with probabilities, it is easier to work with ma- trices, and we can define a standardized matrix notation called sampling matrix formalism according to [71]. Notably, we can treat many other approaches in the framework of matrix-multiplication, and will therefore also rely on it in the follow- n×c ing. For this we let S ∈ R be a matrix such that   1 if the i-th column of A is chosen in the j-th independent trial Si j := (4.2)  0 otherwise,

c×c and let D ∈ R be a diagonal matrix such that

√ Dtt = 1/ cpit . (4.3)

Using this, we can write the output of the sampling process described in Eq. 4.1 via the simplified notation CR = ASD(SD)T B ≈ AB. (4.4)

The matrix multiplication algorithm is given below in Algorithm 2, and below we will next give a proof of its correctness.

In the following analysis we will use the fact that for rank-1 matrices it holds i i that A Bi 2 = A 2 kBik2. In order to see why this is the case observe that

q q i i T i T i T i A Bi 2 = (A Bi) (A Bi) = Bi (A ) A Bi, 4.3. Basic matrix multiplication 71

Algorithm 2 The BASIC-MATRIX-MULTIPLICATION Algorithm [70]

m×n n×p n INPUT: A ∈ R , B ∈ R , integer c > 0 and sampling probabilities {pi}i=1. OUTPUT: C and R s.t. CR ≈ AB.

1: C,R as all-zeros matrices. 2: for i=1 to c do 3: Sample an index it ∈ {1,...,n} w. prob. P[it = k] = pk, i.i.d with replace- ment. t it √ √ 4: Set C = A / cpit and Rt = Bit / cpit . 5: end for 6: return C and R. but this is simply the inner product and hence

q q 2 i 2 2 i 2 i σmax(kBik2 kA k2) = kBik2 kA k2 = kBik2 A 2 .

We now prove a Lemma that states that CR is an unbiased estimator for AB, element-wise, and calculate the variance of that estimator. This variance is strongly dependent on the sampling probabilities. Based on this, we can then derive optimal sampling probabilities which minimise the variance. In practice these will not be accessible and one typically needs to find approximations for these. In practice, we therefore use other distributions, such as the row or column norms, or so-called leverage scores [71, 69].

m×n n×p Lemma 5 ([70]). Given two matrices A ∈ R and B ∈ R , construct matrices C and R with the matrix multiplication algorithm from above. Then,

E[(CR)i j] = (AB)i j, (4.5) n A2 B2 1 ik k j 1 2 Var[(CR)i j] = ∑ − (AB)i j (4.6) c k=1 pk c

 it  A Bit Proof. Fix an index-pair (i, j) and define the random variable Xt = = cpit i j   Aiit Bit j c , and observe that (CR)i j = ∑ Xt. Then we have that E[Xt] = cpit t=1 2 2 n AikBk j 1 2 n AikBk j ∑ pk = (AB)i j, and E[X ] = ∑ 2 . Furthermore we just need to k=1 cpk c t k=1 c pk 4.3. Basic matrix multiplication 72 sum up in order to obtain E[(CR)i j], and hence obtain

c E[(CR)i j] = ∑ E[Xt] = (AB)i j, t=1

c and Var[(CR)i j] = ∑t=1 Var[Xt], where we can determine the variance of Xt easily 2 2 from Var[Xt] = E[Xt ] − E[Xt] , and the lemma follows.

We can use this directly to establish the expected error in terms of the Frobenius 2 norm, i.e., E[kAB −CRkF ], by observing that we can treat this as a sum over each of the individual entries, i.e.,

m p 2  2  E[kAB −CRkF ] = ∑ ∑ E (AB −CR)i j . i=1 j=1

Doing so, we obtain the following result.

Lemma 6 (Basic Matrix Multiplication [70]). Given matrices A and B, construct matrices C and R with the matrix multiplication algorithm from above. Then it holds that, n k 2 2 2 A 2 kBkk2 1 2 E[kAB −CRkF ] = ∑ − kABkF (4.7) k=1 cpk c Furthermore, if k optimal A 2 kBkk2 pk = pk = 0 , k 0 ∑k0 A 2 kBk k2 are the sampling probabilities used, then

n k 2 ∑ A kBkk 1 E[kAB −CRk2 ] = k=1 2 2 − kABk2 . (4.8) F c c F

Proof. As mentioned above, first note that

m p 2  2  E[kAB −CRkF ] = ∑ ∑ E (AB −CR)i j . i=1 j=1

2 2 Oberve that the squared terms give (AB − CR)i j = (AB)i j − (AB)i j(CR)i j − 2 (CR)i j(AB)i j + (CR)i j, and using the results for E[(CR)i j] from Lemma 5 and 4.3. Basic matrix multiplication 73

2 since E[(ABCR)i j] = (AB)i j, we have

2 2 2 2 2 E[(AB −CR)i j] = E[(CR)i j] − (AB)i j = E[(CR)i j] − E[(CR)i j] = Var[(CR)i j].

2 m p Therefore we have E[kAB −CRkF ] = ∑i=1 ∑ j=1 Var[(CR)i j]. Using the result from Lemma 5 again for the variance, we then directly obtain the result by using that 2  k 2  2  2 ∑i Aik = A 2 and ∑ j Bk j = kBkk2.

The sampling probabilities that we have used here are optimal in the sense that they minimize the expected error. To show this we can define the function

n k 2 2 n A 2 kBkk2 f ({pi}i=1) = ∑ , k=1 pk which captures the pk-dependent part of the error. In order to find the optimal probabilities we minimise this with the constraint that ∑k pk = 1, i.e., we define n the function g = f ({pi}i=1) + λ (∑k pk − 1). Setting the derivative of this function w.r.t. the pk to zero and correctly normalising them then gives the probabilities in Lemma 6. Note a few important points about this result.

n • Note the results depends strongly on {pk}k=1. In particular we need a strategy to obtain these probabilities.

• With Markov’s inequality we can remove the expectation (using the assump- optimal tion that pk ≥ β ∗ pk ). To do this we reformulate Eq. 4.8 and incorporate the second term such that we obtain for some 1 ≥ β > 0 and hence nearly optimal probabilities for β sufficiently close to 1,

n k 2 ∑ A kBkk 1 1 E[kAB −CRk2 ] = k=1 2 2 − kABk2 ≤ kAk2 kBk2 , F βc c F βc F F (4.9) where we neglect the last term and used the Cauchy-Schwarz inequality, i.e.,

!2 n n 2 n k k 2 ∑ A kBkk2 ≤ ∑ A ∑ kBkk2 . k=1 2 k=1 2 k=1 4.4. Hamiltonian Simulation 74

Now we can apply Jensen’s inequality such that E[kAB −CRkF ] ≤ √1 kAk kBk . Hence if we take the number of samples c ≥ 1/(βε2), βc F F then we obtain a bound of the error E[kAB −CRkF ] ≤ ε kAkF kBkF . We can now use Markov’s inequality to remove the expectation from this bound, and in some cases this will be good enough. Let

" # α P kAB −CRkF > p kAkF kBkF , βc

we obtain using Markov’s inequality that

E[kAB −CRk ] 1 δ ≤ F ≤ . √α kAk kBk α βc F F

And hence with probability ≥ 1 − δ if c ≥ β/δ 2ε2 we have that

kAB −CRkF ≤ ε kAkF kBkF .

While the above bounds are already good, if we desire a small failure proba- bility δ, the number of samples grow too rapidly to be useful in practice. However, the above results can be exponentially improved using much stronger Chernoff-type bounds. These indeed allow us to reduce the number of samples to O(log(1/δ)), rather than poly(1/δ), and therefore lead to much more practically useful bounds. We do not go into much detail here, as we believe the main ideas are covered above and the details are beyond the scope of this thesis. Next, we first use the quantum random access memory from Def. 2 to derive a fast quantum algorithm for Hamil- tonian simulation, and then show how randomised numerical linear algebra can be used to design fast classical algorithms for the same task.

4.4 Hamiltonian Simulation

In this section, we will now derive two results. First, we will derive a quantum algorithm for Hamiltonian simulation, which is based on the qRAM model that we described above in Def. 2. Next, we will design a classical version for Hamiltonian simulation which is independent of the dimensionality which depends on the Frobe- 4.4. Hamiltonian Simulation 75 nius norm, and therefore on the spectral norm and the rank of the input Hamiltonian. The classical algorithms relies on the classical data structure 3.

4.4.1 Introduction

Hamiltonian simulation is the problem of simulating the dynamics of quantum sys- tems, i.e., how a quantum system evolves over time. Using quantum computers to describe these dynamics was the original motivation by Feynman for quantum com- puters [76, 77]. It has been shown that quantum simulation is BQP-hard, and there- fore it was conjectured that no classical algorithm can solve it in polynomial time, since such an algorithm would be able to efficiently solve any problem which can be solved efficiently with a quantum algorithm, including integer factorization [78]. One important distinction between the quantum algorithm which we present in this thesis, and quantum algorithms which have been traditionally developed is the input model. A variety of input models have been considered in previous quantum algorithms for simulating Hamiltonian evolution. The local Hamiltonian model is defined by a number of local terms of a given Hamiltonian. On the other hand, the sparse-access model for a Hamiltonian H with sparsity s, i.e., the number of non-zero entries per row or column, is specified by the following two oracles:

OS |i, ji|zi 7→ |i, ji|z ⊕ Si, ji, and (4.10)

OH |i, ji|zi 7→ |i, ji|z ⊕ Hi, ji, (4.11)

N×N for i, j ∈ [N], where Si, j is the j-th nonzero entry of the i-th row of H ∈ R and ⊕ denotes the bit-wise XOR. Note that H is a Hermitian matrix, and therefore square. The linear combination of unitaries (LCU) model decomposes the Hamiltonian into a linear combination of unitary matrices and we are given the coefficients and access to an implementation of each unitary. The first proposal for an implementation of Hamiltonian simulation on a quan- tum computer came from Lloyd [8], and was based on local Hamiltonians. Later, Aharonov and Ta-Shma described an efficient algorithm for an arbitrary sparse Hamiltonian [79], which was only dependent on the sparsity instead of the di- 4.4. Hamiltonian Simulation 76 mension N. Subsequently, a wide range of algorithms have been proposed, each improving the runtime [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]. Most of these al- gorithms have been defined in the sparse-access model, and have lead to optimal dependence on all (or nearly all) parameters for sparse Hamiltonians over the re- cent years [90, 91, 92].

The above-mentioned input models are highly relevant when we want to sim- ulate a physical system, for example to obtain the energy of a small molecule. However, these models are not generally used in modern machine learn- ing or numerical linear algebra algorithms, as we have seen above. Here it can be more convenient to work with access to a quantum random access memory (qRAM) model, which we have described in Def. 2. In this model, we assume that the en- tries of a Hamiltonian are stored in a binary tree data structure [16], and that we have quantum access to the memory. Here, quantum access implies that the qRAM is able to efficiently prepare quantum states corresponding to the input data. The use of the qRAM model has been successfully demonstrated in many applications such as quantum principal component analysis [13], quantum support vector ma- chines [14], and quantum recommendation systems [16], among many other quan- tum machine learning algorithms.

In contrast to prior work, we consider the qRAM model in order to simulate not necessarily sparse Hamiltonians. The qRAM allows us to efficiently prepare states that encode the rows of the Hamiltonian. Using the combination of this ability to prepare states in combination with a [90], we derive the first Hamilto- √ nian simulation algorithm in the qRAM model whose time complexity has Oe( N) dependence, where Oe(·) hides all poly-logarithmic factors, for non-sparse Hamil- tonians of dimensionality N. Our results immediately imply [93] a quantum linear system algorithm in the qRAM model with square-root dependence on dimension and poly-logarithmic dependence on precision, which exponentially improves the precision dependence of the quantum linear systems algorithm by [94].

The main hurdle in quantum-walk based Hamiltonian simulation is to ef- ficiently prepare the states which allow for a quantum walk corresponding to 4.4. Hamiltonian Simulation 77 e−iH/kHk1 . These states are substantially different to those which have been used in previous quantum algorithms [16, 81, 90]. Concretely, the states required in previous algorithms, such as [90] allow for a quantum walk corresponding to e−iH/(skHkmax), where s is the row-sparsity of H. These states can be prepared with O(1) queries to the sparse-access oracle. However, the states we are required to prepare for the non-sparse Hamiltonian simulation algorithm cannot use structural features such as sparsity of the Hamiltonian, and it is not known how to prepare such state efficiently in the sparse-access model. In the qRAM model on the other hand, we are able to prepare such states with time complexity (circuit depth) of O(polylog(N)), as we will demonstrate below. Using the efficient state preparation procedure in combination with a linear combination of quantum walks, we are able to simulate the time evolution for non- sparse Hamiltonians with only poly-logarithmic dependency on the precision. The main result of this chapter is summarised in the following theorem, which we prove in Sec. 4.4.4.

N×N n Theorem 6 (Non-sparse Hamiltonian Simulation). Let H ∈ C (with N = 2 for a n qubit system) be a Hermitian matrix stored in the data structure as specified in Definition 2. There exists a quantum algorithm for simulating the evolution of H for time t and error ε with time complexity (circuit depth)

 log(t kHk/ε)  O t kHk n2 log5/2(t kHk /ε) . (4.12) 1 1 loglog(t kHk/ε)

Here k·k1 denotes the induced 1-norm (i.e., maximum absolute row-sum N−1 norm), defined as kHk1 = max j ∑k=0 |Hjk|, k·k denotes the spectral norm. In the following we will also need the max norm k·kmax, defined by kHkmax = maxi, j |Hjk|. √ Since it holds that kHk1 ≤ N kHk (see [95]), we immediately obtain the following corollary.

N×N n Corollary 7. Let H ∈ C (where N = 2 ) be a Hermitian matrix stored in the data structure as specified in Definition 2. There exists a quantum algorithm for simulating the evolution of H for time t and error ε with time complexity (circuit 4.4. Hamiltonian Simulation 78 depth)

 √ √ log(t kHk/ε)  O t N kHk n2 log5/2(t N kHk/ε) . (4.13) loglog(t kHk/ε)

Remarks: √ 1. As we can see from Corollary 7, the circuit depth scales as Oe( N). How- ever, the gate complexity can in principle scale as O(N2.5 log2(N)) due to the addressing scheme of the qRAM, as defined in Definition 2. For structured Hamiltonians H, the addressing scheme can of course be implemented more efficiently.

2. If the Hamiltonian H is s-sparse, i.e., H has at most s non-zero entries in each row or column, then the time complexity (circuit depth) of the algorithm is given by

 √ √ log(t kHk/ε)  O t skHk n2 log5/2(t skHk/ε) . (4.14) loglog(t kHk/ε)

To show this, we need the following proposition.

N×N Proposition 1. If H ∈ C has at most s non-zero entries in any row, it √ holds that kAk1 ≤ skAk.

2 H H  2 Proof. First observe that kAk ≤ ∑i λi(A A) = Tr A A = kAkF , and fur- 2 2 thermore we have that ∑i j |ai j| ≤ smax j∈[N] ∑i |ai j| . From this we have that √ kAk ≤ skAk1 for a s-sparse A. By [96, Theorem 5.6.18], we have that kAk kAk1 ≤ CM(1,∗)kAk for CM(1,∗) = maxA6=0 kAk and using the above we √ 1 √ have CM(1,∗) ≤ s. Therefore we find that kAk1 ≤ skAk as desired. √ The result then immediately follows since kAk1 ≤ N kAk for dense A, i.e., √ using kHk1 ≤ skHk from Proposition 1 with s = N, and the result from Theorem 6.

3. We note that we could instead of qRAM also rely on the sparse-access model in order to prepare the states given in Eq. (4.17). The time complexity of the 4.4. Hamiltonian Simulation 79

state preparation for the states in Eq. (4.17) in the sparse-access model re-

sults in an additional O(s) factor (for computing σ j) compared to the qRAM model, and therefore in a polynomial slowdown. Therefore, in order to sim- ulate H for time t in the sparse-access model, the time complexity of the al- 1.5 √ gorithm in terms of t, d, and kHk results in O(ts kHk) as kHk1 ≤ skHk, which implies that the here presented methods do not give any advantage over previous results in the sparse-access model.

Similarly to the quantum algorithm, we have also designed a fast classical al- gorithm for Hamiltonian simulation which indeed also only depends on the sparsity of the matrix and the norm. Our classical algorithm requires H to be row-searchable and row-computabile, which we will define later. In short, these requirements are both fulfilled if we are given access to the memory structure described in Defini- tion 3. Informally our results can be summarised as follows:

Theorem 8 (Hamiltonian Simulation With The Nystrom¨ Method (Informal version N×N of Theorem 11)). Let H ∈ C be a Hermitian matrix, which is stored in the memory structure given in Definition 3 and at most s non-zero entries per row, N and if ψ ∈ C is an n-qubit quantum state with at most q entries. Then, there exists an algorithm that, with probability of at least 1−δ, approximates any chosen amplitude of the state eiHtψ in time

! t9 kHk4 kHk7  1 2 O sq + F n + log , ε4 δ

up to error ε in spectral norm, where k·kF is the Frobenius norm, k·k is the spectral norm, s is the maximum number of non-zero elements in the rows or columns of H, and q is the number of non-zero elements in ψ.

Our algorithm is efficient in the low-rank and sparse regime, so if we com- pare our algorithm to the best quantum algorithm in the sparse-input model or the black-box Hamiltonian simulation model, only a polynomial slowdown occurs. A result which was recently published [21] is efficient when H is only low-rank and in comparison their time complexity scales as poly(t,kHkF ,1/ε), where kHkF is 4.4. Hamiltonian Simulation 80 the Frobenius norm of H. Notably, our algorithm has therefore stricter require- ments, such as the sparsity of the Hamiltonian and the sparsity of the input state |ψi, however we achieve much lower polynomial dependencies. By analysing the dependency of the runtime on the Frobenius norm we can determine under which conditions we can obtain efficient (polylogarithmic runtime in the dimensionality). Hamiltonian simulations. Informally, we obtain:

Corollary 9 (Informal). If H is a Hamiltonian on n qubits with at most s = 2 1 2 O(polylog(N)) entries per row such that kHkF − N Tr(H) ≤ O(polylog(N)), and if ψ is an n-qubit quantum state with at most q = O(polylog(N)) entries, then there exists an efficient algorithm that approximates any chosen amplitude of the state eiHtψ.

While these results have rule out the possibility of exponential speedups of our quantum algorithm in the low-rank regime, we note that the complexity of the quan- tum algorithm has a much lower degree in the polynomials compared to the classical algorithms. The quantum algorithm has hence still a large polynomial speedup over the classical algorithm (for low-rank Hamiltonians and dense Hamiltonians).

4.4.2 Related work

As already discussed earlier, there are a range of previous results. Many of these are given in different data access models so we will in the following briefly discuss the main results in the respective access model.

Hamiltonian simulation with kHk dependence. The black-box model for Hamil- tonian simulation is a special case of the sparse-access model. It is suitable for non- sparse Hamiltonians and allows the algorithm to query the oracle with an index-pair

|i, ji, which then returns the corresponding entry of the Hamiltonian H, i.e., OH de- fined in Eq. (4.11). Given access to a Hamiltonian in the black-box model allows to simulate H with error ε with query complexity

√ O((kHkt)3/2N3/4/ ε) 4.4. Hamiltonian Simulation 81 for dense Hamiltonians [81]. Empirically, the authors of [81] additionally observed, that for several classes of Hamiltonians, the actual number of queries required for simulation even reduces to √ O( N log(N)). √ However, this Oe( N) dependence does not hold provably for all Hamiltonians, and was therefore left as an open problem. After the first version of this work was made public, this open problem was almost resolved by Low [92], who proposed a quantum algorithm for simulating black-box Hamiltonians with time complexity

√ O((t N kHk)1+o(1)/εo(1)).

We note that the qRAM model is stronger than the black-box model, and therefore the improvements could indeed stem from the input model. However, our model to date gives the best performance for quantum machine learning applications, and therefore might overall allow us to achieve superior performance in terms of com- putational complexity.

Hamiltonian simulation with kHkmax dependence. As previously noted, the qRAM model is stronger than the black-box model and the sparse-access model. One immediate implication of this is that prior quantum algorithms such as [81, 90] can immediately be used to simulate Hamiltonians in the qRAM model. Using black-box Hamiltonian simulation for a s-sparse Hamiltonian, the circuit depth is then given by Oe(tskHkmax) [81, 90]. For non-sparse H, this implies a scaling of √ √ Oe(tN kHkmax). Since kHk ≤ N kHkmax implies Oe(t H kHk) = Oe(tN kHkmax), our results do not give an advantage if the computational complexity is expressed in form of the kHkmax. However, we can express the computational complexity in terms of the kHk, which plays a crucial role in solving linear systems [36, 93], and black-box unitary implementation [81]. In this setting, we achieve a quadratic im- provement w.r.t. the dimensionality dependence, as the inequality kHkmax ≤ kHk implies Oe(tN kHkmax) = Oe(tN kHk). 4.4. Hamiltonian Simulation 82

Hamiltonian simulation in the qRAM model. Shortly after the first version of our results were made available, Chakraborty, Gilyen,´ and Jeffery [37] independently proposed a quantum algorithm for simulating non-sparse Hamiltonians. Their al- gorithm makes use of a qRAM input model similar to the one proposed by Kereni- dis [16], and achieved the same computational complexity as our result. However, their work is also generalising our results since it uses a more general input model, namely, the block-encoding model in which it first frames the results. This was first H/α ·  proposed in [87], and it assumes that we are given a unitary ·· that contains the Hamiltonian H/α (where we want to simulate H) in its upper-left block. The time evolution e−iHt is then performed in Oe(α kHkt) time. Using the sparse-access model for s-sparse Hamiltonian H, a block-encoding of H with α = d can be effi- ciently implemented, implying an algorithm for Hamiltonian simulation with time complexity [87] Oe(skHkt).

The main result for Hamiltonian simulation described in [37] then takes the qRAM √ model for a s-sparse Hamiltonian H, and shows that a block-encoding with α = s can be efficiently implemented. This yields an algorithm for Hamiltonian simula- tion with time complexity √ Oe( skHkt).

We note that the techniques introduced in [37] have been generalised in [97] to a quantum framework for implementing singular value transformation of matrices. Furthermore, [37] gives a detailed analysis which applies their results to the quan- tum linear systems algorithm, which we have omitted in our analysis.

Quantum-inspired classical algorithms for Hamiltonian simulation. The prob- lem of applying functions of matrices has been studied intensively in numerical linear algebra. The exponential function is one application which has received par- ticular interest in the literature [98, 99, 100, 101]. For arbitrary Hermitian matrices, at this point in time, no known algorithm exists that exhibits a runtime logarithmic in the dimension of this input matrix. However, such runtimes are required if we 4.4. Hamiltonian Simulation 83 truly want to simulate the time evolution of quantum systems, as the dimensions of the matrix that governs the evolution scale exponentially with the number of quan- tum objects (such as atoms or orbitals) in the system.

More recently, the field of randomised numerical linear algebra techniques, from which we have previously seen the example of randomised matrix multiplica- tion (c.f. Section 4.3) has enabled new approaches to such problems. These meth- ods, along with results from spectral graph theory, have culminated in a range of new classical algorithms for matrix functions. In particular, they have also recently given new algorithms for approximate matrix exponentials [70, 102, ?, 69, 55]. Pre- vious results typically hold for matrices, which offer some form of structure. For example, Orrecchia et al. [103] combined function approximations with the spectral sparsifiers of Spielmann and Teng [104, 105] into a new algorithm that can approx- imate exponentials of strictly diagonally dominant matrices in time almost linear in the number of non-zero entries of H. A standard approach is to calculate low-rank approximations of matrices, and then use these within function-approximations to implement fast algorithms for matrix functions [106]. Although these methods have many practical applications, they are not suitable to the application at hand, i.e., Hamiltonian simulation, since they produce sketches that do not generally preserve the given symmetries of the input matrix.

For problems where the symmetry of the sketched matrix is preserved, alter- native methods such as the Nystrom¨ method have been proposed. The Nystrom¨ method has originally been developed for the approximation of kernel matrices in statistical learning theory. Informally, Nystrommethods¨ construct a lower- dimensional, symmetric, positive semidefinite approximation of the input matrix N×N by sampling from the input columns. More specifically, let A ∈ R be a symmetric, rank r, positive semidefinite matrix, A j the j-th column vector of A, H and Ai the i-th row vector of A, with singular value decomposition A = UΣU + r −1 i i H (Σ = diag(σ1,...,σr)), and Moore-Penrose pseudoinverse A = ∑t=1 σt U (U ) . The Nystrom¨ method then finds a low-rank approximation for the input matrix A which is close to A in spectral norm (or Frobenius norm), which also preserves the 4.4. Hamiltonian Simulation 84 symmetry and positive semi-definiteness property of the matrix. Let C denote the n × l matrix formed by (uniformly) sampling l  n columns of A, W denote the l ×l matrix consisting of the intersection of these l columns with the corresponding l rows of A, and Wk denote the best rank-k approximation of W, i.e.,

W = l×l kV −Wk . k argminV∈R ,rank(V)=k F

The Nystrom¨ method therefore returns a rank-k approximation A˜k of A for k < n defined by: ˜ + > Ak = CWk C ≈ A

The running time of the algorithm is O(nkl) [107]. There exist many ways of sam- pling from the initial matrix A in order to create the approximation A˜k, and in particular non-uniform sampling schemes enable us to improve the performance, see for example Theorem 3 in [108]. The Nystrom¨ method results in particu- larly good approximations for matrices which are approximately low rank. The first applications it was developed for were regression and classification prob- lems based on Gaussian processes, for which Williams and Seeger developed a sampling-based algorithm [54, 54, 109]. Since the technique exhibit similari- ties to a method for solving linear integral equations which was developed by Nystrom¨ [110], they denoted their result as Nystrom¨ method. Other methods, which have been developed more recently include the Nystrom¨ extension, which has found application in large-scale machine learning problems, statistics, and signal process- ing [54, 109, 111, 112, 113, 107, 114, 115, 116, 117, 118, 111, 119].

After the here presented quantum algorithm and our Nystrom¨ algorithm for Hamiltonian simulation which can use the memory model from Def. 3 have been made public, a general framework was proposed to perform such quantum- inspired classical algorithms which are based on the classical memory structure from Def. 3 [21], namely sampling and query access. Notably, this memory struc- ture was first proposed by Tang [17]. Although their framework generalises all quantum-inspired algorithm using the input model from Def. 3, our classical al- 4.4. Hamiltonian Simulation 85 gorithm does not require the specified input model. Furthermore our algorithm achieves significantly lower polynomial dependencies. While the time complex- ity of their algorithm has a 36-th power dependency on the Frobenius norm of the Hamiltonian, our algorithm scales with a 4-th power. This however comes at the cost of a sparsity requirement on the Hamiltonian and the input state, which can be restrictive in practical cases, i.e., in non-sparse but low-rank cases. Therefore, the results by [21] might generally better suitable for dense Hamiltonians. A big differ- ence is that the algorithm of Tang and others are solving the same problem as the quantum algorithm that we developed, i.e., it allows us to sample from the output distribution of the time evolution. Taking this into account, we believe our classical Nystrom¨ based algorithm can be translated into this framework as well by using rejection sampling approaches similar to the ones introduced in [17]. However, we leave this as an open question for future work.

Summary of related results. To summarise this subsection, we provide Table 4.3 for state-of-the-art algorithms for Hamiltonian simulation which include quantum and classical ones.

4.4.3 Applications In the following, we also lay out a few applications of Hamiltonian simulation. Since the main focus of this thesis is quantum machine learning we in particu- lar refer to applications in this area. For this, we will discuss the quantum linear systems algorithm, which is a key subroutine in most of the existing quantum ma- chine learning algorithms with an acclaimed exponential speedup. Of course, other applications such as the estimation of properties such as ground state energies of chemical systems exist [120].

Unitary implementation. One application which follows directly from the abil- ity to simulate non-sparse Hamiltonians is the ability to approximately imple- ment an arbitrary unitary matrix, the so-called unitary implementation problem: given access to the entries of a unitary U construct a quantum circuit U˜ such that

U −U˜ ≤ ε for some fixed ε. Unitary implementation can be reduced to Hamil- tonian simulation as shown by Berry and others [81, 121]. For this, consider the 4.4. Hamiltonian Simulation 86

Advantage of our quantum algo- Model State-of-the-art rithm

Sparse-access with Oe(tskHkmax) [90] No advantage kHkmax dependence

Sparse-access with √ Subpolynomial improvement in O((t skHk)1+o(1)/εo(1)) [92] t,s kHk dependence ; exponential improvement in ε √ qRAM Oe(t skHk) [37] Same result

Classical sampling poly(t,kHk ,1/ε) [21] Polynomial speedup and query access F

Classical sampling and query access, poly(s,t,kHkF ,1/ε), [Thm.8] Polynomial speedup sparse input √ Figure 4.3: Comparing our result O(t d kHk polylog(t,d,kHk,1/ε)) with other quan- tum and classical algorithms for different models. Since the qRAM model is stronger than the sparse-access model and the classical sampling and query access model, we consider the advantage of our algorithm against others when they are directly applied to the qRAM model.

Hamiltonian of the form

  0 U H =  . (4.15) UH 0

By performing the time evolution according to e−iHπ/2 on the state |1i|ψi, we are able to implement the operation e−iHπ/2 |1i|ψi = −i|0iU |ψi, which means that we can apply U to |ψi. If the entries of U are stored in a data structure similar to Definition 2, using our Hamiltonian simulating algorithm we can implement the √ unitary U with time complexity (circuit depth) O( N polylog(N,1/ε)).

Quantum linear systems solver. As previously mentioned, one of the major sub- routines of the quantum linear systems algorithm is Hamiltonian simulation. Hamil- tonian simulation, in particular in combination with the phase estimation [122] al- gorithm is used to retrieve the eigenvalues of the input matrix, which can then be inverted to complete the matrix inverse. Assuming here for simplicity that |bi is 4.4. Hamiltonian Simulation 87 entirely in the column-space of A, then the linear systems algorithm solves a linear system of the form Ax = b and outputs an approximation to the normalised solution |xi = |A−1bi. For s-sparse matrices [93] showed that A−1 can be approximated as a linear combination of unitaries of the form e−iAt. Notably for non-Hermitian A, we can just apply the same ideas to the Hamiltonian

  0 A H =  , (4.16) AH 0 which is Hermitian by definition, and due to the logarithmic runtimes results in only a factor 2 overhead. In order to implement these unitaries we can then use any Hamiltonian simulation algorithm, such as [90], which results in an overall (gate) complexity of O(sκ2polylog(N,κ/ε)). For a non-sparse input matrix A, the algo- rithm however scales as Oe(N) (again ignoring logarithmic factors). In some of our earlier results, we used a similar data structure to Definition 2 to design a quantum algorithm for solving linear systems for non-sparse matrices [94] which resulted in √ a time complexity (circuit depth) of O(κ2 Npolylog(N)/ε). Using again this data structure, similarly to [16, 94], our new Hamiltonian simulation algorithm, and the linear combinations of unitaries (LCU) decompositions from [93], we can describe a quantum algorithm for solving linear systems for non-sparse matrices with time complexity (circuit depth)

√ O(κ2 Npolylog(κ/ε)), which is an exponential improvement in the error dependence compared to our pre- vious result [94]. Notably, one drawback of our implementation, which has been resolved by [37] and subsequent works is the high condition number dependency of our algorithm which seems restrictive for practical applications.

4.4.4 Hamiltonian Simulation for dense matrices

We now show how we can use the data structure from Definition 2 to derive a fast quantum algorithm for non-sparse Hamiltonian simulation. We will in particular 4.4. Hamiltonian Simulation 88 proof the results from Theorem 6. We prove the result in multiple steps. First, we show how to use the data structure to prepare a certain state. Next, we use the state preparation to perform a quantum walk. In the final step, we show how the quantum walk can be used to implement Hamiltonian simulation. This then leads to the main result.

State Preparation. Using the data structure from Definition 2, we can efficiently perform the mapping described in the following technical lemma for efficient state preparation.

N×N n Lemma 7 (State Preparation). Let H ∈ C be a Hermitian matrix (where N = 2 for a n qubit Hamiltonian) stored in the data structure as specified in Definition 2.

Each entry Hjk is represented with b bits of precision. Then the following holds

N−1 1. Let kHk1 = max j ∑k=0 Hjk as before. A quantum computer that has access to the data structure can perform the following mapping for j ∈ {0,...,N − 1},

N−1 r ! logN 1 q ∗ kHk1 − σ j | ji|0 i|0i 7→ p | ji ∑ |ki Hjk |0i + |1i , kHk1 k=0 N (4.17)

2 5/2 with time complexity (circuit depth) O(n b logb), where σ j = ∑k |Hjk|, and p q ∗ ∗ the square-root satisfies Hjk Hjk = Hjk.

2. The size of the data structure containing all N2 complex entries is O(N2 log2(N)).

In order to perform this mapping, we will need the following Lemma. We will use it in order to efficiently implement the conditional rotations of the qubits with complex numbers.

Lemma 8. Let θ,φ0,φ1 ∈ R and let θe,φe0,φe1 be the b-bit finite precision represen- tation of θ,φ0, and φ1, respectively. Then there exists a unitary U that performs the 4.4. Hamiltonian Simulation 89 following mapping:

  iφe0 iφe1 U : |φe0i|φe1i|θei|0i 7→ |φe0i|φe1i|θei e cos(θe)|0i + e sin(θe)|1i . (4.18)

Moreover, U can be implemented with O(b) 1- and 2-qubit gates.

Proof. Define U as

  

i|0ih0|φe0 i|1ih1|φe1 U =  ∑ |φe0ihφe0| ⊗ e  ∑ |φe1ihφe1| ⊗ e  b b φe0∈{0,1} φe1∈{0,1}   −iY  ∑ |θeihθe| ⊗ e θe, (4.19) θe∈{0,1}b

0 −i  where Y = i 0 is the Pauli Y matrix. To implement the operator |θihθ| ⊗ e−iYθe, we use one rotation con- ∑θe∈{0,1}b e e trolled on each qubit of the first register, with the rotation angles halved for each successive bit. The other two factors of U can be implemented in a similar way. Therefore, U can be implemented with O(b) 1- and 2-qubit gates.

Before proving Lemma 7, we first describe the construction and the size of the data structure. Readers may refer to [16] for more details.

• The data structure is built from N binary trees Di,i ∈ {0,...,N − 1} and we start with an empty tree.

• When a new entry (i, j,Hi j) arrives, we create or update the leaf node j in the

tree Di, where the adding of the entry takes O(log(N)) time, since the depth N×N of the tree for H ∈ C is at most log(N). Since the path from the root to the leaf is of length at most log(N) (under the assumption that N = 2n), we have furthermore to update at most log(N) nodes, which can be done in O(log(N)) time if we store an ordered list of the levels in the tree.

• The total time for updating the tree with a new entry is given by log(N) × log(N) = log2(N). 4.4. Hamiltonian Simulation 90

• The memory requirements for k entries are given by O(klog2(N)) as for every

entry ( j,k,Hjk) at least log(N) nodes are added and each node requires at most O(log(N)) bits.

Now we are ready to prove Lemma 7.

Proof of Lemma 7. With this data structure, we can perform the mapping specified in Eq. (4.17), with the following steps. For each j, we start from the root of D j. Starting with the initial state | ji|0logNi|0i, first apply the rotation (according to the value stored in the root node and calculating the normalisation in one query) on the last register to obtain the state

v  uN−1 1 logN u ∗ q p |0i|0 it ∑ |Hjk||0i + kHk1 − σ j |1i, (4.20) kHk1 k=0

which is normalised since σ j = ∑k Hjk , and we have by definition that ∗ q N−1 ∗ q N−1 ∗  ∑k=0 |Hjk| · ∑k=0 |Hjk| = ∑k |Hjk| = σ j. Then a sequence of conditional rotations is applied on each qubit of the second register to obtain the state as in

Eq. (4.17). At level ` of the binary tree D j, a query to the data structure is made to load the data c (stored in the node) into a register in superposition, the rotation √ q ` to perform is proportional to c, (kHk1 − σ j)/2 (assuming at the root, ` = 0, and for the leaves, ` = logN). Then the rotation angles will be determined by calculating the square root and trigonometric functions on the output of the query: this can be implemented with O(b5/2) 1- and 2-qubit gates using simple techniques based on Taylor series and long multiplication as in [90], where the error is smaller than that caused by truncating to b bits. Then the conditional rotation is applied by the circuit described in Lemma 8, and the cost for the conditional rotation is O(b). There are n = log(N) levels, so the cost excluding the implementation of the oracle is O(nb5/2). To obtain quantum access to the classical data structure, a quantum addressing scheme is required. One addressing scheme described in [6] can be used. Although the circuit size of this addressing scheme is Oe(N) for each D j, its circuit depth is O(n). Therefore, the time complexity (circuit depth) for preparing 4.4. Hamiltonian Simulation 91 the state in Eq. (4.17) is O(n2b5/2 logn).

We use the following rules to determine the sign of the square-root of a com- iϕ plex number: if Hjk is not a negative real number, we write Hjk = re (for r ≥ 0 q ∗ √ −iϕ/2 and −π ≤ ϕ ≤ π) and take Hjk = re ; when Hjk is a negative real number, q ∗ p we take Hjk = sign( j − k)i |Hjk| to avoid the sign ambiguity. With this recipe, p q ∗ ∗ we have Hjk Hjk = Hjk.

In order to convey the working of the data structure better, we give in the following a example of the state preparation procedure based on the data structure in Fig. 4.1. For the sake of comprehensibility and simplicity, we only take an example t with 4 leaves {c1,c2,c3,c4}, and hence H = [c1 c2 c3 c4] . The initial state (omitting the first register) is |00i|0i. Let σ = |c0|+|c1|+|c2|+|c3|. Apply the first rotation, we obtain the state

  1 p q p |00i |c0| + |c1| + |c2| + |c3||0i + kHk1 − σ j |1i = kHk1 1 √ q  p |00i σ |0i + kHk1 − σ |1i = kHk1 1 √ q  p σ |00i|0i + kHk1 − σ |00i|1i . (4.21) kHk1

Then, apply a rotation on the first qubit of the first register conditioned on the last register, we obtain the state

1 p p  p |c0| + |c1||00i + |c2| + |c3||10i |0i kHk1 r ! ! kHk − σ + 1 j (|00i + |10i) |1i . (4.22) 2

Next, apply a rotation on the second qubit of the first register conditioned on the 4.4. Hamiltonian Simulation 92

first qubit of the first register and last register, we obtain the desired state:

1 √ √ √ √ p c0 |00i|0i + c1 |01i|0i + c2 |10i|0i + c3 |11i|0i kHk1 r r kHk − σ kHk − σ + 1 j |00i|1i + 1 j |01i|1i 4 4 r r ! kHk − σ kHk − σ + 1 j |10i|1i + 1 j |11i|1i 4 4 r ! 1 √ kHk1 − σ j = p |00i c0 |0i + |1i kHk1 4 r ! √ kHk − σ +|01i c |0i + 1 j |1i 1 4 r ! √ kHk − σ +|10i c |0i + 1 j |1i 2 4 r !! √ kHk − σ +|11i c |0i + 1 j |1i . (4.23) 3 4

The Quantum Walk Operator. Based on the data structure specified in Defini- tion 2 and the efficient state preparation in Lemma 7, we construct a quantum walk operator for H as follows. First define the isometry T as

N−1 T = ∑ ∑ (| jih j| ⊗ |bihb|) ⊗ |ϕ jbi, (4.24) j=0 b∈{0,1} with |ϕ j1i = |0i|1i and

N−1 r ! 1 q ∗ kHk1 − σ j |ϕ j0i = p ∑ |ki Hjk |0i + |1i , (4.25) kHk1 k=0 N

N−1 where σ j = ∑k=0 |Hjk|. Let S be the swap operator that maps | j0i|b0i| j1i|b1i to

| j1i|b1i| j0i|b0i, for all j0, j1 ∈ {0,...,N − 1} and b0,b1 ∈ {0,1}. We observe that

q ∗ pH H∗ jk jk H h j|h0|T HST |ki|0i = = jk , (4.26) kHk1 kHk1 4.4. Hamiltonian Simulation 93 where the second equality is ensured by the choice of the square-root as in the proof of Lemma 7. This implies that

H (I ⊗ h0|)T HST(I ⊗ |0i) = . (4.27) kHk1

The quantum walk operator U is defined as

U = iS(2TT H − I). (4.28)

A more general characterization of the eigenvalues of quantum walks is presented in [123]. Here we give a specific proof on the relationship between the eigenvalues of U and H as follows.

Lemma 9. Let the unitary operator U be defined as in Eq. (4.28), and let λ be an eigenvalue of H with eigenstate |λi. It holds that

U |µ±i = µ± |µ±i, (4.29) where

|µ±i =(T + iµ±ST)|λi|0i, and (4.30)

±iarcsin(λ/kHk ) µ± = ± e 1 . (4.31)

H H Proof. By the fact that T T = I and (I ⊗ h0|)T ST(I ⊗ |0i) = H/kHk1, and (I ⊗ h1|)T HST(I ⊗ |0i) = 0, we have

 2λi  U |µ±i = µ±T |λi|0i + i 1 + µ± ST |λi|0i. (4.32) kHk1

In order for this state being an eigenstate, it must hold that

2λi 2 1 + µ± = µ±, (4.33) kHk1 4.4. Hamiltonian Simulation 94 and the solution is s λi λ 2 µ = ± 1 − = ±e±iarcsin(λ/kHk1). (4.34) ± kHk 2 1 kHk1

Linear combination of unitaries and Hamiltonian simulation. Next, we need to convert the quantum walk operator U = iS(2TT H − I) into an operator for Hamil- tonian simulation. For this, we consider the generating functions for the Bessel functions, denoted by Jm(·). From [124, (9.1.41)], we know that it holds that

∞  z  1  m izλ/kHk1 ∑ Jm(z)µ± = exp µ± − = e , (4.35) m=−∞ 2 µ± where the second equality follows from Eq. (4.31) and the fact that sin(x) = (eix − e−ix)/2i. This leads to the following linear combination of unitaries:

∞ ∞ Jm(z) m m izH/kHk1 V∞ = ∑ ∞ U = ∑ Jm(z)U = e , (4.36) m=−∞ ∑ j=−∞ Jj(z) m=−∞

∞ where the second equality follows from the fact that ∑ j=−∞ Jj(z) = 1. Since we cannot in practice implement the infinite sum, we will in the follow- ing find an approximation to e−izH/kHk1 by truncation the sum in Eq. (4.36):

k Jm(z) m Vk = ∑ k U . (4.37) m=−k ∑ j=−k Jj(z)

k Here the coefficients are normalised by ∑ j=−k Jj(z) so that they sum to 1. This will minimize the approximation error (see the proof of Lemma 10), and the normalisa- tion trick was originated in [90]). The eigenvalues of Vk are

k Jm(z) m ∑ k µ±. (4.38) m=−k ∑ j=−k Jj(z)

m Note that each eigenvalue of Vk does not depend on ± as J−m(z) = (−1) Jm(z). To bound the error in this approximation, we require the following technical 4.4. Hamiltonian Simulation 95 lemma.

Lemma 10. Let Vk and V∞ be defined as above. There exists a positive integer k satisfying k ≥ |z| and

 log(kHk/(kHk ε))  k = O 1 , (4.39) loglog(kHk/(kHk1 ε)) such that

kVk −V∞k ≤ ε. (4.40)

Proof. The proof outlined here follows closely the proof of Lemma 8 in [90]. Re- calling the definition of Vk and V∞, we define the weights in Vk by

Jm(z) αm := , (4.41) Ck

k where Ck = ∑l=−k Jl(z). The normalisation here is chosen so that ∑m am = 1 which will give the best result [90]. Since ∞ k ∑ Jm(z) = ∑ Jm(z) + ∑ Jm(z) = 1, (4.42) m=−∞ m=[−∞:−k−1;k+1:∞] m=−k observe that we have two error sources. The first one comes from the truncation of the series, and the second one comes from the different renormalisation of the terms which introduces an error in the first |m| ≤ k terms in the sum. We therefore start by bounding the normalisation factor Ck. For the Bessel-functions for all m it holds 1 z |m| m that |Jm(z)| ≤ |m|! 2 , since J−m(z) = (−1) Jm(z) [124, (9.1.5)]. For |m| ≤ k we can hence find the following bound on the truncated part

∞ ∞ |z/2|m |J (z)| = 2 |J (z)| ≤ 2 ∑ m ∑ m ∑ m m=[−∞:−k−1;k+1:∞] m=k+1 m=k+1 ! |z/2|k+1  |z/2| |z/2|2  = 2 1 + + + ··· (k + 1)! k + 2 (k + 2)(k + 3) |z/2|k+1 ∞ 1m−(k+1) 4|z/2|k+1 < 2 ∑ = . (4.43) (k + 1)! m=k+1 2 (k + 1)! 4.4. Hamiltonian Simulation 96

Since ∑m Jm(z) = 1, based on the normalisation, we hence find that

k  4|z/2|k+1  ∑ Jm(z) ≥ 1 − , (4.44) m=−k (k + 1)!

Jm(z) which is a lower bound on the normalisation factor C . Since am = , the cor- k Ck rection is small, which implies that

 |z/2|k+1  a = J (z) 1 + O , (4.45) m m (k + 1)! and we have a multiplicative error based on the renormalisation.

Next we want to bound the error in the truncation before we join the two error sources. From Eq. (4.35) we know that

∞ izλ/Λ m e − 1 = ∑ Jm(z)(µ± − 1), (4.46) m=−∞ by the normalisation of ∑m Jm(z). From this we can see that we can hence obtain a bound on the truncated Jm(z) as follows.

k m izλ/Λ m ∑ Jm(z)(µ± − 1) = e − 1 − ∑ Jm(z)(µ± − 1). (4.47) m=−k m=[−∞:−(k+1); (k+1):∞]

Therefore we can upper bound the left-hand side in terms of the exact value of V∞, i.e. eizλ/Λ if we can bound the right-most term in Eq. (4.47). Using furthermore the bound in Eq. (4.45) we obtain

  k m  izλ/Λ m  ∑ am(z)(µ± − 1) = e − 1 − ∑ Jm(z)(µ± − 1) m=−k  m=[−∞:−(k+1);  (k+1):∞]  |z/2|k+1  1 + O , (4.48) (k + 1)! 4.4. Hamiltonian Simulation 97

izλ/Λ which reduced with 2 − 1 ≤ |zλ/Λ| and |z| ≤ k to

  k m izλ/Λ  m  ∑ am(z)(µ± − 1) = e − 1 − O ∑ Jm(z)(µ± − 1). (4.49) m=−k m=[−∞:−(k+1);  (k+1):∞]

We can then obtain the desired bound kV∞ −Vkk by reordering the above equation, k and using that ∑m=−k am(z) = 1 such that we have

 

k m izλ/Λ  m  kV∞ −Vkk = ∑ amµ± − e = O ∑ Jm(z)(µ± − 1). (4.50) m=−k m=[−∞:−(k+1);  (k+1):∞]

We hence only need to bound the right-hand side. m For µ+ we can use that |µ+ − 1| ≤ 2|mλ/Λ| =: 2|mν| and obtain the bound |ν| z k+1 2 k! 2 [90]. For the µ− case we need to refine the analysis and will show that the bound remains the same. Let ν := λ/Λ as above. First observe that m −m −m m Jm(z)µ− + J−m(z)µi = J−m(z)µ+ + Jm(z)µ+, and it follows that

−(k+1) ∞ m m ∑ Jm(z)(µ− − 1) + ∑ Jm(z)(µ− − 1) m=−∞ m=k+1 −(k+1) ∞ m m = ∑ Jm(z)(µ+ − 1) + ∑ Jm(z)(µ+ − 1). (4.51) m=−∞ m=k+1 4.4. Hamiltonian Simulation 98

Therefore we only need to treat the µ+ case.

∞ m m Jm(z)(µ − 1) ≤ 2 Jm(z)(µ − 1) ∑ + ∑ + m=[−∞:−(k+1); m=k+1 (k+1):∞] ∞ m ≤ 2 ∑ |Jm(z)||µ+ − 1| m=k+1 ∞ 1 z |m| m = 2 ∑ |µ+ − 1| m=k+1 |m|! 2 ∞ 1 z |m|

≤ 4 ∑ m|ν| m=k+1 |m|! 2 8|ν| z k+1 < (k + 2). (4.52) (k + 1)! 2

Using this bound, we hence obtain from Eq. (4.50),

k  z k+1 kHk(z/2)k+1  m izλ/Λ λ kV∞ −Vkk = ∑ amµ± − e ≤ O = O . m=−k k! Λ 2 Λk! (4.53) In order for the above equation being upper-bounded by ε, it suffices to choose some k that is upper bounded as claimed.

As we can see from the above discussion, we can hence use the operator Vk to implement the time evolution according to e−izH/kHk1 . We can also immediately see from this, that Vk is a linear combination of unitaries (LCU). We proceed now to implement this LCU by using some well-known results. In the following, we provide technical lemmas for implementing linear combi- nation of unitaries. Suppose we are given the implementations of unitaries U0, U1,

..., Um−1, and coefficients α0,α1,...,αm−1. Then the unitary

m−1 V = ∑ α jUj (4.54) j=0 can be implemented probabilistically by the technique called linear combination of m−1 unitaries (LCU) [125]. Provided ∑ j=0 |α j| ≤ 2, V can be implemented with success probability 1/4. To achieve this, we define the multiplexed-U operation, which is 4.4. Hamiltonian Simulation 99 denoted by multi-U, as

multi-U | ji|ψi = | jiUj |ψi. (4.55)

The probabilistic implementation of V is summarised in the following lemma.

m−1 Lemma 11. Let multi-U be defined as above. If ∑ j=0 |α j| ≤ 2, then there exists a quantum circuit that maps |0i|0i|ψi to the state

√ m−1 ! 1 3 ⊥ |0i|0i ∑ α jUj |ψi + |Φ i, (4.56) 2 j=0 2 where (|0ih0| ⊗ |0ih0| ⊗ I)|Φ⊥i = 0. Moreover, this quantum circuit uses O(1) applications of multi-U and O(m) 1- and 2-qubit gates.

m−1 Proof. Let s = ∑ j=0 |α j|. We first define the unitary operator B to prepare the coefficients:

r s r s  1 m−1 √ B|0i|0i = |0i + 1 − |1i ⊗ √ ∑ α j | ji. (4.57) 2 2 s j=0

Define the unitary operator W as W = (BH ⊗ I)(I ⊗ multi-U)(B ⊗ I). We claim that W performs the desired mapping, as

W |0i|0i|ψi =(BH ⊗ I)(I ⊗ multi-U)(B ⊗ I)|0i|0i|ψi m−1 1 H √ =√ (B ⊗ I)|0i ∑ α j | jiUj |ψi 2 j=0 r m−1 2 − s H √ + (B ⊗ I)|1i ∑ α j | jiUj |ψi 2s j=0 m−1 1 √ ⊥ = |0i|0i ∑ α jUj |ψi + γ |Φ i, (4.58) 2 j=0 where |Φ⊥i is a state satisfying (|0ih0| ⊗ |0ih0| ⊗ I)|Φ⊥i = 0, and γ is some nor- malisation factor. The number of applications of multi-U is constant, as in the definition of W. 4.4. Hamiltonian Simulation 100

To implement the unitary operator B, O(m) 1- and 2-qubit gates suffice.

Let W be the quantum circuit in Lemma 11, and let P be the projector defined as P = |0ih0| ⊗ |0ih0| ⊗ I. We have

1 m−1 PW |0i|0i|ψi = |0i|0i ∑ α jUj |ψi. (4.59) 2 j=0

m−1 If ∑ j=0 α jUj is a unitary operator, one application of the oblivious amplitude H m−1 amplification operator −W(1 − 2P)W (1 − 2P)W implements ∑ j=0 Uj with cer- tainty [82]. However, in our application, the unitary operator We implements an approximation of V∞ in the sense that

1 PWe |0i|0i|ψi = |0i|0iVk |ψi, (4.60) 2 with kVk −V∞k ≤ ε. The following lemma shows that the error caused by the obliv- ious amplitude amplification is bounded by O(ε).

Lemma 12. Let the projector P be defined as above. If a unitary oper- ator W satisfies PW | i| i| i = 1 | i| iV | i where V −V ≤ . Then e e 0 0 ψ 2 0 0 e ψ e ε

H −We (I − 2P)We (I − 2P)We |0i|0i|ψi − |0i|0iV |ψi = O(ε).

Proof. We have

H −We (I − 2P)We (I − 2P)We |0i|0i|ψi H =(We + 2PWe − 4WPe We PWe )|0i|0i|ψi H =(We + 2PWe − 4WPe We PPWPe )|0i|0i|ψi  H  =We |0i|0i|ψi + |0i|0iVe |ψi −We |0i|0iVe Ve |ψi (4.61)

H Because Ve −V ≤ ε and V is a unitary operator, we have Ve Ve − I = O(ε). Therefore, we have

H −We (I − 2P)We (I − 2P)We |0i|0i|ψi − |0i|0iVe |ψi = O(ε). (4.62) 4.4. Hamiltonian Simulation 101

Thus

H −We (I − 2P)We (I − 2P)We |0i|0i|ψi − |0i|0iV |ψi = O(ε). (4.63)

Now we are ready to prove Theorem 6.

Proof of Theorem 6. The proof we outline here follows closely the proof given in [90]. The intuition of this algorithm is to divide the simulation into O(t kHk1) segments, with each segment simulating e−iH/2. To implement each segment, we use the LCU technique to implement Vk defined in Eq. (4.37), with coefficients k k αm = Jm(z)/∑ j=−k Jj(z). When z = −1/2, we have ∑ j=−k |α j| < 2. Actually, this holds for all |z| ≤ 1/2 because

k k k | j|  k+1 −1 |Jj(z)| |z/2| 4|z/2| |α j| ≤ ≤ 1 − ∑ ∑ |z/2|k+1 ∑ | j|! (k + 1)! j=−k j=−k 1 − 4 (k+1)! j=−k ∞ √ 8 16 1 8 16 4  < + ∑ j = + e − 1 < 2, (4.64) 7 7 j=1 4 j! 7 7

k k+1 where the first inequality follows from the fact that ∑ j=−k Jj(z) ≥ 1−4|z/2| /(k+ |m| 1)! (see [90]), the second inequality follows from the fact that |Jm(z)| ≤ |z/2| /|m|! (see [124, (9.1.5)]), and the third inequality uses the assumption that |z| ≤ 1/2. Now, Lemmas 11 and 12 can be applied. By Eq. (4.35) and using Lemma 10, set

 0  log(kHk/(kHk1 ε )) k = O 0 , (4.65) loglog(kHk/(kHk1 ε )) and we obtain a segment that simulates e−iH/(2kHk1) with error bounded by O(ε0). 0 Repeat the segment O(t kHk1) times with error ε = ε/(t kHk1), and we obtain a simulation of e−iHt with error bounded by ε. It suffices to take

 log(t kHk/ε)  k = O . (4.66) loglog(t kHk/ε)

By Lemma 11, each segment can be implemented by O(1) application of 4.4. Hamiltonian Simulation 102 multi-U and O(k) 1- and 2-qubit gates, as well as the cost for computing the co- efficients αm for m ∈ {−k,...,k}. The cost for each multi-U is k times the cost for implementing the quantum walk U. By Lemma 7, the state in Eq. (4.25) can be prepared with time complexity (circuit depth) O(n2b5/2), where b is the number of bit of precision. To achieve the overall error bound ε, we choose b = O(log(t kHk1 /ε)). Hence the time complexity for the state preparation is 2 5/2 O(n log (t kHk1 /ε)), which is also the time complexity for applying the quan- tum walk U. Therefore, the time complexity for one segment is

 log(t kHk/ε)  O n2 log5/2(t kHk /ε) . (4.67) 1 loglog(t kHk/ε)

Considering O(t kHk1) segments, the time complexity is as claimed.

Note that the coefficients α−k,...,αk (for k defined in Eq. (4.65)) in Lemma 11 can be classically computed using the methods in [126, 127], and the cost is O(k) times the number of bits of precision, which is O(log(t kHk/ε). This is no larger than the quantum time complexity.

Discussion. We have hence seen that we can design a quantum Hamiltonian simula- √ tion algorithm which hast time complexity Oe( N) even for non-sparse Hamiltoni- ans. The algorithm relies heavily on the access to a seemingly powerful input model. While it is questionable that it is even possible to implement such a data structure physically due to the exponential amount of quantum resources [43, 72, 5], and this requirement might even be further increased through a potentially strong require- ments in terms of hte error rate per gate of O(1/poly(N)) to retain a feasible error rate for applications [75], for us there is an even more important question: How fast can classical algorithms be if they are given a similarly powerful data structure? We investigate this question in the next section, where we use the so-called Nystrom¨ approach to simulate a Hamiltonian on a classical computer.

4.4.5 Hamiltonian Simulation with the Nystrom¨ method

In this section, we derive a classical, randomised algorithm for the strong simulation of quantum Hamiltonian dynamics which is based on the Nystrom¨ method that we 4.4. Hamiltonian Simulation 103 introduced in Section 4.4.2[Quantum-inspired classical algorithms for Hamiltonian simulation]. We particularly prove the results in Theorem 8, and the Corollary 9. For this, we develop an algorithm for so-called strong quantum simulation, where our objective is to obtain an algorithm which can compute the amplitude of a par- ticular outcome and hence the entries of the final state after the evolution for time t. This is in contrast to the weak simulation, where we require the algorithm to only be able to sample from the output distribution of a quantum circuit. Informally, in the one case we are hence able to query the algorithm with an index i, and the algorithm will be able to return ψ[i] (or more precisely the projection of the state vector in the i-th element of the computational basis) while in the other case we can only obtain the basis state i with probability |ψ[i]|2. Strong simulation is known to have unconditional and exponential lower bounds [128]. This implies that it is therefore in general hard for both classical and quantum computers. On the other hand, weak simulation can be performed efficiently by a quantum computer (for circuits of polynomial size). From the perspective of computational complexity, we are therefore attempting to solve a stronger problem compared to the Hamiltonian simulation that the quantum algorithm performs, since the latter can only sample from the probability distribution induced by measurements on the output state. Sur- prisingly, we are still able to find cases in which the time evolution can be simulated efficiently. Our algorithm is able to efficiently perform the Hamiltonian simulation and output the the requested amplitude (c.f., Theorem 8) if we grant it access to a memory structure which fulfills the following requirements. Note that these are in principle more general than the assumption of the classical memory structure from Definition 3 but the memory structure immediately fulfills these requirements. On the other hand, we also require input and row sparsity in order to perform the simulation efficiently (i.e., in O(log(N)) time for a n-qubit system with dimension N = 2n. We next discuss these requirements in more detail.

Input Requirements We assume throughout the chapter that the input matrix and state are sparse, i.e., we assume that every row of H has at most s non-zero entries, and that the input state ψ has at most q non-zero entries. In order for algorithm to 4.4. Hamiltonian Simulation 104 work, we then require the following assumptions:

1. We require H to be row-computable, i.e., there exists a classical efficient algo- rithm that, given a row-index i, outputs a list of the non-zero entries (and their indices) of the row. If we are not having access to a memory structure such as described in Def. 3 or a structured Hamiltonian, then this condition is in gen- eral only fulfilled if every row of H has at most a number s = O(polylog(N)) of non-zero entries.

2. We require that the entries of the initial state ψ are row-computable, i.e., there exists a classical efficient algorithm that outputs a list of the non-zero entries similar as above. In order for this to be generally valid, we require the state to have at most q = O(polylog(N)) non-zero entries. However, we can also use the data structure given in Def. 3 or a structured input.

3. We require H to be efficiently row-searchable. This condition informally states that we can efficiently sample randomly selected indices of the rows of H in a way proportional to the norm of the row (general case) or the diagonal element of the row (positive semidefinite case). This is fundamentally equiv- alent to the sample-and-query access discussed earlier, and indeed, the data structure allows us to immediately perform this operation as well. We will show this relationship later on.

While the notion of row-searchability is commonly assumed to hold in the ran- domised numerical linear algebra literature (e.g., [129][Section 4]), in the context of quantum systems this is not given in general, since we are dealing with exponen- tially sized matrices. It is of course reasonable to assume that for a polynomially sized matrix we are indeed able to evaluate all the row-norms efficiently in a time (number of steps) proportional to the number of non-zero entries. In order to obtain an efficient algorithm (i.e., one which only depends log- arithmically on the dimension N = 2n for a n-qubit system), we require sparsity of the Hamiltonian H and the input state ψ. In general we therefore require 4.4. Hamiltonian Simulation 105 s = q = O(polylog(N)) for the non-zero entries in order for the algorithm to be efficient.

High-level description of the algorithm The algorithm proceeds by performing a two steps approximation. First, we approximate the Hamiltonian H in terms of a low rank operator Hb which is small, and therefore more amendable for the computations which we perform next. We obtain Hb by sampling the rows with a probability proportional to the row norm, and hence do rely on a Nystrom¨ scheme. Second, we approximate the time evolution of the input state eiHtb ψ via a truncated Taylor expansion of the matrix exponential. This can be efficiently performed since we can use the small operator Hb and the spectral properties of the truncated exponential.. As an addition, we separately consider the case of a generic Hamiltonian H and the restricted case of positive semidefinite Hamiltonians. As is the case in general, for the more restricted case of the PSD Hamiltonian, we are able to derive tighter bounds. We will briefly discuss the sampling scheme to obtain H˜ to give the reader an idea of the overall procedure before going into the detailed proofs. In both cases, our algorithm leverages on a low-rank approximation of the Hamiltonian H to efficiently approximate the matrix exponential eiHt which are obtained by randomly sampling M = O(polylog(N)) rows according to the mag- M×N nitude of the row norms, and then collating them in a matrix A ∈ C . Let now M×N M×M A ∈ C be the matrix obtained by sub-sampling M rows of H, and B ∈ C the matrix obtained by selecting the columns of A whose indices correspond to those M indices for the rows of H.

For SPD H, we then use an approximation of the form Hb = AB+AH, where B+ is the pseudoinverse of the SPD matrix B.

Next, in order to perform the time evolution eiHtb ψ by truncating the Taylor series expansion of the matrix exponential function after the K-th order. In order to apply the individual terms in the truncated Taylor series, we then make use of the structure of Hb, and formulate the operator only in terms of linear operations involv- ing the matrices AHA, B+ and B and the vector AHψ. Under the above assumptions 4.4. Hamiltonian Simulation 106

(Paragraph 4.4.5[Input Requirements]), and for s = q = O(polylog(N)) all these operations can be performed efficiently. In the general Hermitian setting, we form A by first sampling the rows of H and then rescale the sampled rows according to their sampling probability. Unlike in the case of SPD H, we approximate the Hamiltonian by the approximation Hb2 := AAH to approximate H2. This approximation is useful here, since we decompose the matrix exponential into two auxiliary functions, and then for each of these evalu- ate its truncated Taylor series expansion. By doing so, we can formulate the final approximation solely in terms of linear operations involving AHA and AHψ. These operations can then again be performed efficiently under the initial assumptions oh H and ψ.

Row-searchability implies efficient row-sampling As we see in the assumptions, all our algorithms require the Hamiltonian to be row-searchable. In this section we describe an efficient algorithm for sampling rows of a row-searchable Hamiltonians according to some probability distribution. Let n ∈ N. We first introduce a binary tree of subsets spanning {0,1}n. In the following, with abuse of notation, we iden- tify binary tuples with the associated binary number. Let L be a binary string with |L| ≤ n, where |L| denotes the length of the string. We denote with S(L) the set

n−|L| S(L) = {L} × {0,1} = {(L1,...,L|L|,v1,...,vn−|L|) | v1,...,vn−|L| ∈ {0,1}}. (4.68)

We are now ready to state the row-searchability property for a matrix H.

Definition 4 (Row-searchability). Let H be a Hermitian matrix of dimension 2n, for n ∈ N. H is row-searchable if, for any binary string L with |L| ≤ n, it is possible to compute the following quantity in O(poly(n))

w(S(L)) = ∑ h(i,H:,i), (4.69) i∈S(L) where h is the function computing the weight associated to the i-th column H:,i. For positive semidefinite H we use h(i,H:,i) = Hi,i, i.e. the diagonal element i while for 4.4. Hamiltonian Simulation 107

2 general Hermitian H we use h(i,H:,i) = kH:,ik .

Row-searchability intuitively works as follows: We are working with the fol- lowing binary tree corresponding to a Hamiltonian H. We start at the leaves of the tree, which contain the individual probabilities according to which we want to sample from H. The parents at each level are then given (and computed) by the marginals over their children nodes, i.e., the sum over the probabilities of the chil- dren (assuming discrete probabilities). Using this tree, we can for a randomly sam- pled number in [0,1] traverse through the levels of the tree in log(N) time to find the leave node that is sampled, i.e. the indices of the column of H. More specifically, row-searchability then requires the evaluation of w(S(L))) as defined in Eq. (4.69) 2 which computes marginals of the diagonal of H or the norm kH:,ik in the general case, where the co-elements, i.e. the elements where we are not summing over, are defined by the tuple L. Hence, for empty L, w(S(L)) = Tr(H). Note that this as- sumption is obviously closely related to the data structure given in Definition 3, and indeed we can use this data structure to immediately perform these operations.

Algorithm 3 MATLAB code for the sampling algorithm

Input: wS(L) corresponds to the function w(S(L)) defined in Eq. 4.69. Output: L is the sampled row index L = []; q = rand()*wS(L); for i=1:n if q >= wS([L 0]) L = [L 0]; else L = [L 1]; q = q - wS([L 0]); end end

Note that the function h that we described above is indeed related to leverage score sampling, which is a widely used sampling process in randomised numerical linear algebra [?, 69]. Leverage scores allow us to efficiently obtain sample prob- 4.4. Hamiltonian Simulation 108 abilities which have a sufficiently low variance to obtain fast algorithms with low error.

Alg. 3 describes an algorithm, that, given a row-searchable H, is able to n sample an index with probability p( j) = h( j,H:, j)/w({0,1} ). Let q be a ran- dom number uniformly sampled in [0,T], where T = w({0,1}n) is the sum of the weights associated to all the rows. The algorithm uses logarithmic search, starting with L empty and adding iteratively 1 or 0, to find the index L such that w({0,...,L − 1}) ≤ q ≤ w({0,...,L}). The total time required to compute one in- dex, is O(nQ(n)) where Q(n) is the maximum time required to compute a w(S(L)) for L ∈ {0,1}n. Note that if w(S(L)) can be computed efficiently for any L ∈ {0,1}n, then Q(n) is polynomial and the cost of the sampling procedure will be polynomial.

Remark 1 (Row-searchability more general than sparsity). Note that if H has a polynomial number of non-zero elements, then w(S(L)) can be always computed in polynomial time. Indeed given L, we go through the list of elements describing H and select only the ones whose row-index starts with L and then compute w(S(L)), both step requiring polynomial time. However w(S(L)) can be computed efficiently even for Hamiltonians that are not polynomially sparse. For example, take the n diagonal Hamiltonian defined by Hii = 1/i for i ∈ [2 ]. This H is not polynomially sparse and in particular it has an exponential number of non-zero elements, but still w(S(L)) can be computed in polynomial time, here in particular in O(1). As it turns out, the data structure from Definition 3, which was first proposed by Tang [17] to do a similar sampling process efficiently, also allows us to perform this operation efficiently with the difference, that this holds then true for arbitrary Hamiltonians.

Algorithm for PSD row-searchable Hermitian matrices Given a 2n × 2n (i.e., N × N) matrix H  0, the algorithm should output an approximation for the state given by the time evolution

ψ(t) = exp(iHt)ψ. (4.70) 4.4. Hamiltonian Simulation 109

The algorithm will do this through an expression of the form expc (iHtb )ψ, where expc and Hb are an approximation of the exponential function and the low rank approxi- mation of H respectively.

The first algorithm we describe here applies for for H  0. We will then gen- eralise this result in the following section to arbitrary Hermitian H. All our results hold under the row-searchability condition, i.e. if condition 4 is fulfilled.

Let h be the diagonal of the positive semidefinite H and let t1,...,tM, with n M ∈ N be indices i.i.d. sampled with repetition from {1,...,2 } according to the probabilities

p(q) = hq/∑hi, (4.71) i M×M e.g. via Alg. 3. Then, let B ∈ C where Bi, j = Hti,t j , for 1 ≤ i, j ≤ M. Let 2n×M n furthermore A ∈ C be the matrix with Ai, j = Hi,t j for 1 ≤ i ≤ 2 and 1 ≤ j ≤ M. We then define the approximation for H by Hb = AB+AH, where (·)+ denotes again the pseudoinverse. We now define a function g(x) = (eitx − 1)/x. We can reformulate this, and immediately have eitx = 1+g(x)x, and note that g is an analytic function, i.e., it has a series expansion

(it)k g(x) = ∑ xk−1. k≥1 k!

Then

iHt + H + H + H + H e b = I + g(Hb)Hb = I + g(AB A )AB A = I + Ag(B A A)B A , (4.72)

k where the last step is due to the fact that for any analytic function q(x) = ∑k≥0 αkx , it holds that

+ H + H + H k + H q(AB A )AB A = ∑ αk(AB A ) AB A k≥1 + H k + H + H + H = A ∑ αk(B A A) B A = Aq(B A A)B A . k≥1 4.4. Hamiltonian Simulation 110

By writing D = B+AHA, the algorithm then performs the operation

+ H ψbM(t) = ψ + Ag(D)B A ψ.

Next, in order to make the computation feasible, we truncate the series expan- sion after a finite number of terms. To do this, we hence approximate g with gK(x), which limits the series defining g to the first K terms, for K ∈ N. Moreover, we can + H evaluate the function gK(D)B (A ψ) in an iterative fashion, and for this chose

(it)K− j b = v + Db , v = B+(AHψ), j (K − j)! j−1

(it)K + H where b0 = K! v and so bK−1 = g(D)B A ψ. Then, the new approximate state is given by

ψbK,M(t) = ψ + AbK−1. (4.73)

We summarise the algorithm in form of a MATLAB implementation in Alg. 4. Next, we analyse the cost of this algorithm. Let the row sparsity s of H be of order poly(n). Then the total cost of applying this operator is given by O(M2poly(n) + KM2 + M3)) time complexity, where the terms M3 and M2poly(n) are resulting from the calculation of D and the inverse. To compute the total cost in terms of space/memory, note that we do not have to save H or A in memory, but only B,D and 2 the vectors v,b j, which requires a total cost of O(M ). Indeed D can be computed n in the following way: Assuming, without loss of generality, to have 2 /M ∈ N, then

2n/M −1 H D = B ∑ AM(i−1)+1:MiAM(i−1)+1:Mi, i=1 where Aa:b is the submatrix of A containing the rows from a to b. A similar reason- ing holds for the computation of the vector v. In the above computation we have assumed that we can efficiently sample from the matrix H according to the proba- bilities in Eq. 4.71. In order to do this, we are relying on the sampling algorithm which is summarised in Algorithm 3. 4.4. Hamiltonian Simulation 111

Algorithm 4 MATLAB code for approximating Hamiltonian dynamics when H is PSD

Input: M, T = t1,...,tM list of indices computed via Alg. 3. The function compute H subMatrix, given two lists of indices, computes the associated submatrix of H. compute psi subVector, given a list of indices, computes the associated subvector of ψ. Output: vector b, s.t. b = eiHtψ. B = compute_H_subMatrix(T, T);

D = zeros(M,M); v = zeros(M,1); for i=1:(2ˆn/M) E = compute_H_subMatrix((i-1)*M+1:M*i, T); D = D + E’*E; v = v + E’*compute_Psi_subVector((i-1)*M+1:M*i); end u D = B\D; b = zeros(M,1); for j=1:K b = (1i*t)ˆ(K-j)/factorial(K-j) * v + D*b; end

We next analyse the errors and complexity of the algorithm in more detail, i.e., we derive bounds on K and M for a concrete approximation error ε. The results are summarised in the following theorem, which hold for the SPD case of H.

Theorem 10 (Algorithm for simulating PSD row-searchable Hermitian matrices).

Let ε,δ ∈ (0,1], let K,M ∈ N and t > 0. Let H be positive semidefinite, where K is the number of terms in the truncated series expansions of g(Hˆ ) and M the number of samples we take for the approximation. Let ψ(t) be the true evolution (Eq. 4.70) and let ψbK,M(t) be the output of our Alg. 4 (Eq. 4.73). When

2  72Tr(H)t 36Tr(H)t  K ≥ et kHk + log , M ≥ max 405Tr(H), log , ε ε εδ (4.74) 4.4. Hamiltonian Simulation 112 then the following holds with probability 1 − δ,

kψ(t) − ψbK,M(t)k ≤ ε.

Note that with the result above, we have that ψbK,M(t) in Eq. (4.73) (Alg. 4) approximates ψ(t), with error at most ε and with probability at least 1−δ, requiring  2 2   2 2  a computational cost that is O st Tr(H) log2 1 in time and O t Tr(H) log2 1 in ε2 δ ε2 δ memory.

In the following we now prove the first main result of this work. To prove Theorem 10 we decompose the error into multiple contributions. Lemma 13 per- forms a basic decomposition of the error in terms of the distance between H and the approximation Hb as well as in terms of the approximation gK with respect to g. Lemma 14 then provides an analytic bound on the distance between H and Hb, expressed in terms of the expectation of eigenvalues or related matrices which are then concentrated in Lemma 15.

Lemma 13. Let K,M ∈ N and t > 0, then

K+1 (t Hb ) kψ(t) − ψK,M(t)k ≤ t H − Hb + . b (K + 1)!

ixt k−1 k Proof. By definition we have that e = 1 + g(x)x with g(x) = ∑k≥1 x (it) /k! iHt and gK is the truncated version of g. By adding and subtracting e b , we have

iHt iHt iHtb iHtb e ψ − (I + gK(Hb)Hb)ψ ≤ kψk ( e − e + e − (I + gK(Hb)Hb) ). (4.75) By [130],

iHt iHtb e − e ≤ t H − Hb , (4.76) moreover, by [131], and since Hb is Hermitian and hence all the eigenvalues are real, 4.4. Hamiltonian Simulation 113 we have

K+1 K+1 (t Hb ) (t Hb ) iHtb K+1 ilHtb e − (I + gK(Hb)Hb) ≤ sup i e ≤ . (4.77) (K + 1)! l∈[0,1] (K + 1)!

Finally note that kψk = 1.

To study the norm H − Hb note that, since H is positive semidefinite, there H H exists an operator S such that H = SS , so Hi, j = si s j with si,s j the i-th and j-th row of S. Denote with C and Ce the operators

1 M Tr(H) C = SHS, C = s sH. e ∑ t j t j M j=1 ht j

We then obtain the following result.

Lemma 14. The following holds with probability 1. For any τ > 0,

τ −1/2 −1/2 H − Hb ≤ , β(τ) = λmax((C + τI) (C −Ce)(C + τI) ), 1 − β(τ) (4.78)

moreover Hb ≤ kHk.

M× n Proof. Define the selection matrix V ∈ C 2 , that is always zero except for one element in each row which is Vj,t j = 1 for 1 ≤ j ≤ M. Then we have that

A = HV H, B = VHV H,

i.e., A is again given by the rows according to the sampled indiced t1,...,tM and B is the submatrix obtained from taking the rows and columns according to the same indices. In particular by denoting with Pb the operator Pb = SHV H(VSSHV H)+VS, and recalling that H = SSH and C = SHS, we have

+ H H H H H + H H Hb = AB A = SS V (VSS V ) VSS = SPSb .

By definition Pb is an orthogonal projection operator, indeed it is symmetric and, by 4.4. Hamiltonian Simulation 114 definition Q+QQ+ = Q+, for any matrix Q, then

2 H H H H + H H H H + H H H H + Pb = S V [(VSS V ) (VSS V )(VSS V ) ]VS = S V (VSS V ) VS = Pb. (4.79)

Indeed this is a projection in the row space of the matrix R := SHV H, since with the H singular value decomposition R := URΣRVR we have

ˆ H H + H −2 H H H P = R (R R) R = VRΣRUR URΣR UR URΣRVR = VRVR , (4.80)

which spans the same space as R. Finally, since (I − Pb) = (I − Pb)2, and ZHZ = kZk2, we have

2 H 2 H H H − Hb = S(I − Pb)S = S(I − Pb) S = (I − Pb)S . (4.81)

Note that Ce can be rewritten as Ce = SHV HLVS, with L a diagonal matrix, with Tr(H) L j j = . Moreover t j is sampled from the probability p(q) = hq/Tr(H), so Mht j ht j > 0 with probability 1, then L has a finite and strictly positive diagonal, so Ce has the same range of Pb. Now, with C = SHS, we are able to apply Proposition 3 and Proposition 7 of [55], and obtain

2 H τ −1/2 −1/2 (I − Pb)S ≤ , β(τ) = λmax((C + τI) (C −Ce)(C + τI) ). 1 − β(τ) (4.82)

Finally, note that, since Pb is a projection operator we have that Pb = 1, so

H 2 2 Hb = SPSb ≤ Pb kSk ≤ kSk = kHk, where the last step is due to the fact that H = SSH.

Lemma 15. Let δ ∈ (0,1] and τ > 0. When

 Tr(H) 9Tr(H) M M ≥ max 405Tr(H), 67Tr(H)log , τ = log , (4.83) 2δ M 2δ 4.4. Hamiltonian Simulation 115 then with probability 1 − δ it holds that

−1/2 −1/2 1 λmax((C + τI) (C −Ce)(C + τI) ) ≤ . 2

r Tr(H) Proof. Define the random variable ζ j = st j , for 1 ≤ j ≤ M. Note that ht j

s Tr(H) p ζ j ≤ st j ≤ Tr(H), ht j almost surely. Moreover,

2n 2n H Tr(H) H H H Eζ jζ j = ∑ p(q) sqsq = ∑ sqsq = S S = C. (4.84) q=1 hq q=1

By definition of ζ j, we have

M 1 H Ce = ∑ ζ jζ j . M j=1

Since ζ j are independent for 1 ≤ j ≤ M, uniformly bounded, with expectation equal H −1 2 −1 −1 to C, and with ζ j (C+τI) ζ j ≤ ζ j τ ≤ Tr(H)τ , we can apply Proposition 8 of [55], that uses non-commutative Bernstein inequality for linear operators [132], and obtain

r −1/2 −1/2 2α 2α λmax((C + τI) (C −Ce)(C + τI) ) ≤ + , (4.85) 3M Mt

4Tr(C) with probability at least 1 − δ, with α = log τδ . Since

Tr(C) = TrSHS = TrSSH = Tr(H), by Remark 1 of [55], we have that

−1/2 −1/2 1 λmax((C + τI) (C −Ce)(C + τI) ) ≤ , 2 4.4. Hamiltonian Simulation 116

2 2 κ2 with probability 1 − δ, when M ≥ max(405κ ,67κ log 2δ ) and τ satisfies 9κ2 M 2 M log 2δ ≤ τ ≤ kCk (note that kCk = kHk), where κ is a bound for the fol- lowing quantity

H −1 inf[(kCk + τ)(esssupζ j (C + τI) ζ j)] τ>0 (4.86) kHk + τ ≤ Tr(H) inf ≤ Tr(H) := κ2, τ>0 τ where esssup here denotes the essential supremum.

Now we are ready to prove Theorem 10.

Proof of Theorem 10. By Lemma 13, we have

(t Hb )K+1 iHt e ψ − (I + gK(Hb)Hb)ψ ≤ t H − Hb + . (K + 1)!

Let τ > 0. By Lemma 14, we know that Hb ≤ kHk and that

τ H − Hb ≤ , 1 − β(τ)

−1/2 −1/2 β(τ) = λmax((C + τI) (C −Ce)(C + τI) ), with probability 1. Finally by Lemma 15, we have that the following holds with probability 1 − δ,

−1/2 −1/2 1 λmax((C + τI) (C −Ce)(C + τI) ) ≤ , 2 when  Tr(H) M ≥ max 405Tr(H), 67Tr(H)log 2δ and 9Tr(H) Tr(H) τ = log . M 2δ 4.4. Hamiltonian Simulation 117

So we have

K+1 iHt 18Tr(H)t M (t kHk) e ψ − (I + gK(Hb)Hb)ψ ≤ log + , M 2δ (K + 1)! with probability 1 − δ. (tkHk)K+1 ε Now we select K such that (K+1)! ≤ 2 . Since, by the Stirling approximation, we have √ (K + 1)! ≥ 2π(K + 1)K+3/2e−K−1 ≥ (K + 1)K+1e−K−1.

Since (1 + x)log(1/(1 + x)) ≤ −x,

2 for x > 0 we can select K = et kHk + log ε − 1, such that we have

(t kHk)K+1  et kHk log ≤ (K + 1)log (K + 1)! K + 1 2 ! log ε 1 ≤ et kHk 1 + log 2 et kHk log ε 1 + etkHk ε ≤ log . 2

Finally we require M, such that

18Tr(H)t M ε log ≤ , M 2δ 2 and select 72Tr(H)t 36Tr(H)t M = log . ε εδ Then we have that

36Tr(H)t 36Tr(H)t 18Tr(H)t M ε log + loglog ε log ≤ εδ εδ ≤ . M 2δ 2 36Tr(H)t 2 2log εδ 4.4. Hamiltonian Simulation 118

Next, we generalise this results to arbitrary Hermitian matrices under the as- sumption that these are row-searchable, i.e. assuming the ability to sample accord- ing to some leverage of the rows. This will lead to our second result for classical Hamiltonian simulation with the Nystrom¨ method.

Algorithm for row-searchable Hermitian matrices As mentioned, in this section we now generalise our previous result and derive an algorithm for simulating arbi- trary Hermitian matrices. We again provide guarantees on the runtime and errors of the algorithm, generally under the assumption that H is row-searchable. Our algo- rithm for the general case has slightly worse guarantees compared to the SPD case, which is to be expected. Let in the following again s be the maximum number of non-zero elements in any of the rows of H, ε be the error in the approximation of the output states of the algorithm w.r.t. the ideal ψ(t), and t the evolution time of the simulation. Let further K be the order of the truncated series expansions and M the number of samples we take for the approximation.

In the following we again start by describing the algorithm, and then derive bounds on the runtime and error.

For arbitrary matrices H we will use the following algorithm. Sample M ∈ N 2 khik n independent indices t1,...tM, with probability p(i) = 2 , 1 ≤ i ≤ 2 , where hi is kHkF n×M the i-th row of H (sample via Alg. 3). Let A ∈ C2 be the matrix defined by

" # 1 1 A = h ,..., h . p t1 p tM Mp(t1) Mp(t1)

We then approximate H via the matrix Hˆ 2 = AAH. Next, we again define two functions that we will use to approximate the exponential ix e , √ √ √ cos( x) − 1 sin( x) − x f (x) = , g(x) = √ , x x x and denote with fK and gK the K-truncated Taylor expansions of f and g, for K ∈ N, i.e., K (−1) j+1x j K (−1) j+1x j fK(x) = ∑ , gK(x) = ∑ . j=0 (2 j + 2)! j=0 (2 j + 3)! 4.4. Hamiltonian Simulation 119

In particular note that

eix = 1 + ix + f (x2)x2 + ig(x2)x3.

Similar in spirit to the previous approach for SPD H, we hence approximate ix e via the functions fK and gK. The final approximation is then given by

2 2 H 3 2 H ψbK,M(t) = ψ + itu +t A fK(t A A)v + it AgK(t A A)z, (4.87)

H H H H where u = Hˆ ψ, v = A ψ, z = A u. The products fk(A A)v and AgK(A A)z are done by again exploiting the Taylor series form of the two functions and performing only matrix vector products similar to Alg. 4. Recall that s is the maximum number of non-zero elements in the rows of H, and q the number of non-zero elements in ψ. The algorithm then requires O(sq) in space and time to compute u, O(M min(s,q)) in time and O(M) in space to compute v and O(Ms) in time and space to compute z. We therefore obtain a total computational complexity of

time : O(sq + M min(s,q) + sMK), (4.88)

space : O(s(q + M)). (4.89)

Note that if s > M is it possible to further reduce the memory requirements at the cost of more computational time, by computing B = AHA, which can be done in blocks and require O(sM2) in time and O(M2) in memory, and then compute

2 2 3 2 ψbK,M(t) = ψ + itu +t A fK(t B)v + it AgK(t B)z.

In this case the computational cost would be

time : Osq + M min(s,q) + M2(s + K), (4.90)

space : Osq + M2. (4.91)

The properties of the this algorithm are summarised in the following theorem, which 4.4. Hamiltonian Simulation 120 is a formal statement of Theorem 8:

Theorem 11 (Algorithm for simulating row-samplable Hermitian matrices). Let

δ,ε ∈ (0,1]. Let t > 0 and K,M ∈ N, where K is the number of terms in the trun- cated series expansions of g(Hb) and M the number of samples we take for the ap- proximation, and let t > 0. Let ψ(t) be the true evolution (Eq. 4.70) and let ψbK,M(t) be computed as in Eq. 4.87. When

256t4(1 +t2 kHk2)kHk2 kHk2 4kHk2 M ≥ F log F , (4.92) ε2 δ kHk2 q 4(1 +t kHk) K ≥ 4t kHk2 + ε + log , (4.93) ε then

kψbK,M(t) − ψ(t)k ≤ ε, with probability at least 1 − δ.

Note that with the result above, we have that ψbK,M(t) in Eq. (4.87) approxi- mates ψ(t), with error at most ε and with probability at least 1−δ, requiring a com- putational cost that is Osq + M min(s,q) + M2(s + K) in time and Osq + M2 is memory. Combining Eq. 4.90 and 4.91 with Eq. 4.92 and 4.93, the whole computational complexity of the algorithm described in this section, is

! t9 kHk4 kHk7  1 2 time : O sq + F n + log , (4.94) ε4 δ ! t8 kHk4 kHk6  1 2 space : O sq + F n + log , (4.95) ε4 δ

2 where the quantity log 4kHkF in Eq. 4.92 was bounded using the following inequality δkHk2

2 n 2 kHkF 2 λMAX log 2 ≤ log 2 = n, kHk λMAX 4.4. Hamiltonian Simulation 121 where λMAX is the biggest eigenvalue of H. Observe now that simulation of the time evolution of αI does only change the N×N phase of the time evolution, where I ∈ C is the identity matrix and α some real parameter. We can hence perform the time evolution of H˜ := H − αI, since for any efficient classical description of the input state we can apply the time evolution of the diagonal matrix e−iαIt. We can then optimise the parameter α such that the Frobenius norm of the operator H˜ is minimized, i.e.

˜ 2 2 α = argmin H F = argminkH − αIkF , (4.96) α α

Tr(H) from which we obtain the condition α = 2n . Since, in order for the algorithm ˜ to be efficient, we require that H F is bounded by polylogN. Using the spectral theorem, and the fact that the Frobenius norm is unitarily invariant, this in turn gives us after a bit of algebra the condition

1 kHk2 − Tr(H)2 ≤ O(polylog(N)), (4.97) F N for which we can simulate the Hamiltonian H efficiently. We now prove the second main result of this work and establish the correctness of the above results.

Proof of Theorem 11. Denote with

2 2 H H 3 2 H H ZbK(Ht,At) = I + itH +t A fK(t A A)A + it AgK(t A A)A H, 2 2 H H 3 2 H H Zb(Ht,At) = I + itH +t A f (t A A)A + it Ag(t A A)A H.

By definition of ψbK,M(t) and the fact that kψk = 1, we have

iHt kψbK,M(t) − ψ(t)k ≤ ZbK(At,Ht) − e kψk

iHt ≤ ZbK(At,Ht) − Zb(Ht,At) + Zb(Ht,At) − e .

iHt We first study Zb(Ht,At) − e . Define l(x) = f (x)x and m(x) = g(x)x. Note that, 4.4. Hamiltonian Simulation 122 by the spectral theorem, we have

2 2 H H 3 2 H H Zb(Ht,At) = I + itH +t A f (t A A)A + it Ag(t A A)A H = I + itH +t2 f (t2AAH)AAH + it3g(t2AAH)AAH

= I + itH + l(t2AAH) + itm(t2AAH)H.

Since eixt = 1 + ixt + l(t2x2) + itm(t2x2)x, we have

iHt 2 H 2 2 2 H 2 2 Zb(Ht,At) − e = l(t AA ) − l(t H ) + itm(t AA )H − itm(t H )H 2 H 2 2 2 H 2 2 ≤ l(t AA ) − l(t H ) +t m(t AA ) − m(t H ) kHk.

To bound the norms in l,m we will apply Thm. 1.4.1 of [133]. The theorem state that if a function f ∈ L∞(R), i.e. f is in the function space which elements are the essentially bounded measurable functions, it is entirely on C and satisfies | f (z)| ≤ σ|z| e for any z ∈ . Then k f (A) − f (B)k ≤ k f k ∞ kA − Bk. Note that C σ L (R)

∞ (−1) jz j ∞ |z| j ∞ |z| j |z| |l(z)| = ∑ ≤ ∑ ≤ ∑ ≤ e , j=1 (2 j)! j=1 (2 j)! j=1 j!

∞ (−1) jz j ∞ |z| j ∞ |z| j |z| |l(z)| = ∑ ≤ ∑ ≤ ∑ ≤ e . j=1 (2 j + 1)! j=1 (2 j + 1)! j=1 j!

Moreover it is easy to see that klk ∞ ,kmk ∞ ≤ 2. So L (R) L (R)

iHt 2 H 2 2 Zb(Ht,At) − e ≤ 2(1 +t kHk) t AA −t H 2 H 2 = 2t (1 +t kHk) AA − H . 4.4. Hamiltonian Simulation 123

1 H Now note that, by defining the random variable ζi = ht h , we have that p(ti) i ti

M H 1 AA = ∑ ζi, M i=1 n 2 1 [ ] = p(q) h hH = H2, ∀i. E ζi ∑ ti ti q=1 p(q)

Let τ > 0. By applying Thm. 1 of [134] (or Prop. 9 in [55]), for which

s 2 2 2 2 H 2 kHk kHk τ 2kHk kHk τ AA − H ≤ F + F , M M

2 with probability at least 1 − 4 kHkF τ/(eτ − τ − 1). Now since kHk2

kHk2 1 − 4 F τ/(eτ − τ − 1) ≥ 1 − eτ , kHk2 when τ ≥ e, by selecting 4kHk2 τ = 2log F , kHkδ we have that the equation above holds with probability at least 1 − δ. Let η > 0, by selecting −2 2 2 M = 4η kHkF kHk τ, we then obtain H 2 AA − H ≤ η, with probability at least 1 − δ.

Now we study ZbK(At,Ht) − Zb(Ht,At) . Denote with a, b the functions 2 2 a(x) = l(x ), b(x) = m(x ) and with aK,bK the associated K-truncated Taylor expan- sions. Note that a(x) = cos(x) − 1, while b(x) = (sin(x) − x)/x. Now by definition 4.4. Hamiltonian Simulation 124 of ZbK and Zb, we have

√ √ H H ZbK(At,Ht) − Zb(Ht,At) ≤ aK(t AA ) − a(t AA ) √ √ H H +t kHk bK(t AA ) − b(t AA ) .

Note that, since ∑x2 j/(2 j)! = cosh(x) ≤ 2e|x|, j and

x2 j j |(aK − a)(x)| = ∑ (−1) j=K+2 (2 j)! |x|2K+4 |x|2 j (2K + 4)!(2 j)! ≤ ∑ (2K + 4)! j=0 (2 j)! (2 j + 2K + 4)! 2|x|2K+4e|x| ≤ , (2K + 4)!

x2 j j |(bK − b)(x)| = ∑ (−1) j=K+2 (2 j + 1)! |x|2K+4 |x|2 j (2K + 4)!(2 j)! ≤ ∑ (2K + 4)! j=0 (2 j)! (2 j + 2K + 5)! 2|x|2K+4e|x| ≤ . (2K + 4)!

clog c Let R > 0,β ∈ (0,1]. Now note that, by Stirling approximation, c! ≥ e e , so by 4.4. Hamiltonian Simulation 125

e2 1 selecting K = 2 R + log( β ), we have for any |x| ≤ R,

2|x|2K+4e|x| e|x|

log ≤ |x| + (2K + 4)log (2K + 4)! 2K + 4 eR ≤ R + (2K + 4)log 2K + 4  1   4 + log1β  ≤ R − e2R + 4 + log log e + β eR  1  ≤ R − e2R + 4 + log β 1 ≤ −(e2 − 1)R − 4 − log β 1 ≤ −log . β

1 2 So, by choosing K ≥ log β + e R/2, we have |aK(x) − a(x)|,|bK(x) − b(x)| ≤ β. With this we finally obtain

ZbK(At,Ht) − Zb(Ht,At) ≤ 2(1 +t kHk)β,

2 1 when K ≥ e t kAk/2 + log β , and therefore we have

2 kψbK,M(t) − ψ(t)k ≤ 2t (1 +t kHk)η + 2(1 +t kHk)β, with probability at least 1 − δ, when

2 −2 2 2 4kHkF 1 M ≥ 8η kHkF kHk log , K ≥ 4t kAk + log . kHk2 δ β

In particular, by choosing η = ε/(4t2(1 + t kHk) and β = ε/(4(1 + t kHk)), we have

kψbK,M(t) − ψ(t)k ≤ ε, with probability at least 1 − δ.

With this result in mind note that, in the event where AAH − H2 ≤ ε, we 4.4. Hamiltonian Simulation 126 have that H 2 H 2 | AA − kHk | ≤ AA − H ≤ ε, q and therefore, kAk ≤ kHk2 + ε.

4.4.6 Beyond the Nystrom¨ method

While the Nystrom¨ method has been applied very successful over the past decade or so, in recent years, it also has been improved upon using better approximations for symmetric matrices. The results of [135] indicated that the Nystrom¨ method has advantages over the random feature method [136], the closest competitor, both theoretically and empirically. However, as it has recently been established, even the Nystrom¨ method does not attain high accuracy in general. A model which improves the accuracy of the Nystrom¨ method, is the so-called prototype model [137, 138]. The prototype model performs first a random sketch on the input matrix H, i.e., C = HS, where S is a sketching or sampling matrix which samples M rows of H, and then computes the intersection matrix U∗ as

∗ H 2 + + H M×M U := argminU H −CUC F = C H(C ) ∈ R .

The model then approximates H by CU∗CH. While the intial versions of the proto- type model were not efficient due to the cost of calculating U∗, [139] improved these results by approximately calculating the optimal U∗, which led to a higher accuracy compared to the Nystrom¨ method, but also to an improved runtime. As future work, we leave it open, whether these methods can be used to obtain improved classical algorithms for Hamiltonian simulation and quantum machine learning in general.

4.4.7 Conclusion

As we mentioned in the introduction of the chapter, our results are closely related to the so-called ’quantum-inspired’ or ’dequantisation’ results, which followed the work by Tang [17]. It is obvious that our row-searchable condition is fundamen- tally equivalent to the sample and query access requirement, and both our and other results indicate that the state preparation condition requires a careful assessment in 4.4. Hamiltonian Simulation 127 order to determine whether a quantum algorithm provides an actual advantage over its classical equivalent. In particular, our sampling scheme for Hermitian matrices is equivalent to the sampling scheme based on the classical memory structure which we introduced earlier, although for PSD matrices we use a more computationally efficient variant. While we also provide a method based on binary-trees to compute the sampling probabilities, the tree structure of the memory immediately allows us to execute such a sampling process in practice. The main difference is therefore that we use a traditional memory structure and provide a fast way to calculate the marginals using a binary tree (see Sec. 4.4.5), while [17] assumes a memory struc- ture, which allows one to sample efficiently according to this distribution.

Notably, we believe our results can be further improved by using the rejection sampling methods as is done in all quantum-inspired approaches. The rejection sampling is required in order to achieve faster algorithms as they allow us to sample from the output distribution rather than putting out the entirety of it – as is the case with our algorithm.

Based on our results for Hamiltonian simulation, and supported by the large amount of recent quantum-inspired results for QML applications, we believe it less likely that QML algorithms will admit any exponential speedups compared to clas- sical algorithms. The main reason is because, unlike their classical counterparts, most QML algorithms require strong input assumptions. As we have seen, these caveats arise due to two reasons. First, the fast loading of large input data into a quantum computer is a very powerful assumption, which indeed leads to simi- larly powerful classical algorithms. Second, extracting the results from an output quantum state is hard in general. Tang’s [17] breakthrough result in 2018 for the quantum recommendation systems algorithm, which was previously believed to be one of the strongest candidates for a practical exponential speedup in QML, indeed implied that the quantum algorithm does not give an exponential speedup. Tang’s algorithm solves the same problem as the quantum algorithm and just incurs a poly- nomial slow-down. To do so, it relies on similar methods to the ones we described for Hamiltonian simulation with the Nystrom¨ method, and combines these with 4.4. Hamiltonian Simulation 128 classical rejection sampling.

While exponential speedups therefore appear not to be possible in general, in specific cases, such as problems with sparsity assumptions, these dequantisation techniques cannot be applied and QML algorithms still might offer the long-sought exponential advantage.

The most well known quantum algorithm of this resistant type is the HHL al- gorithm for sparse matrix inversion [36], which indeed is BQP-complete. Although HHL has a range of caveats, see e.g. [43], perhaps it can be useful in certain in- stances. Works that build on top of the HHL, such as Zhao et al. on Gaussian process regression [140] and Lloyd et al. on topological data analysis [141] might indeed overcome these drawbacks and achieve a super-polynomial quantum speedup.

The question whether such quantum or quantum-inspired algorithms will be useful in practice is, however, even harder to answer. Indeed, for algorithms with a theoretical polynomial speedup an actual advantage can only be established through proper benchmarks and performance analysis such as a scaling analysis. The recent work by Arrazola et al. [142], for example, implemented and tested the quantum-inspired algorithms for regression and recommendation systems and com- pared these against existing state-of-the-art implementations. While giving various insights, however, it is unclear whether the analysis in question is valid. In order to assure that the algorithms are tested properly, more modern sketching algorithms would need to be used, since the ones that we and also other authors rely on do not necessarily provide performance close to the current state-of-the-art. The rea- son for this is that they were chosen to obtain complexity theoretic results, i.e., to demonstrate that from a theoretical perspective the quantum algorithms do not have an exponential advantage. As was also pointed out in [21] Dahiya, Konomis, and Woodruff [143] already conducted an empirical study of sketching algorithms for low-rank approximation on both synthetic datasets and the movielens dataset, and reported that their implementation “finds a solution with cost at most 10 times the optimal one [...] but does so 10 times faster”, which is in contrast to the results of Arrazola et al. [142]. Additionally, using self-implemented algorithms and compar- 4.4. Hamiltonian Simulation 129 ing these in a non-optimised way to highly optimised numerical packages such as LAPACK, BLAS, or others, is usually not helpful when evaluating the performance of algorithms. We therefore conclude that quantum machine learning algorithms still might have a polynomial advantage over classical ones, but in practice many challenges exist and need to be overcome. It is hence unclear as of today whether they will be able to provide any meaningful advantage over classical machine learning algo- rithms. In the future we hope to see results that implement the quantum algorithms and analyse the resulting overhead, and ultimately benchmark these against their quantum-inspired counterparts. However, we believe that such benchmarks will not be possible in the near future, as the requirements on the quantum hardware are far beyond what is possible today. Chapter 5

Promising avenues for QML

In the previous chapters, we have tried to answer the main questions of this the- sis, i.e., whether (a) quantum machine learning algorithms can offer an advantage over their classical counterparts from a statistical perspective, and (b) whether ran- domised approaches can be used to achieve similarly powerful classical algorithms. We answered both of these questions in the context of supervised quantum machine learning algorithms for classical data, i.e., data that is stored in some form of a classical memory. However, data could also be obtained as the immediate output of a quantum process, and the QML algorithm could therefore also directly use the resulting quantum states as input that could in principle resolve qRAM-related challenges. Indeed, in this regime (QQ in Fig. 1.1), it is very likely that classi- cal machine learning algorithms will have difficulties, since they are fundamentally not able to use the input except through prior sampling (measurement) of the state, which implies a potential loss of information. The ability to prepare, i.e., generate arbitrary input states with a polynomially sized circuit could therefore enable useful quantum machine learning algorithms.

In this chapter, we now briefly discuss a possible future avenue for QML, namely the generative learning of quantum distributions and quantum states. In particular, we provide a method for fully quantum generative training of quantum Boltzmann machines with both visible and hidden units. To do this, we rely on the quantum relative entropy as an objective function, which is significant since prior methods cannot do so due to mathematical challenges. The mathematical challenges 5.1. Generative quantum machine learning 131 are a result of the gradient evaluation which is required in the training process. Our method is highly relevant, as it allows us to efficiently estimate gradients even for nearly parallel states, which has been impossible for all previous methods. We can therefore use our algorithm even to approximate cloning and state preparation for arbitrary input states. We present in the following two novel methods for solving this problem. The first method given an efficient algorithm for a class of restricted quantum Boltzmann machines with mutually commuting Hamiltonians on the hid- den units. In order to train it with and the quantum relative entropy as an objective function, we use a variational upper bound. The second one gener- alises the first result to generic quantum Boltzmann machines by using high-order divided difference methods and linear-combinations of unitaries to approximate the exact gradient of the relative entropy. Both methods are efficient under the assump- tion that Gibbs state preparation is efficient and assuming that the Hamiltonian is a sparse, row-computable matrix.

5.1 Generative quantum machine learning

One objective of QML is to design models that can learn in quantum mechanical settings [144, 5, 145, 146, 147, 148], i.e., models which are able to quickly identify patterns in data, i.e., quantum state vectors, which inhabit an exponentially large vector space [149, 150]. The sheer size of these vectors therefore restricts the com- putations that can be performed efficiently. As we have seen above, one of the main bottlenecks for many QML applications (e.g. supervised learning) is the input data preparation. While qRAM appears not to be a solution to the common data read- in problem, other methods for the fast preparation of quantum states might enable useful QML applications.

The clearest cases where quantum machine learning can therefore provide an advantage is when the input is already in the form of quantum data, i.e., quantum states. An alternative way to qRAM to efficiently generate quantum states accord- ing to a desired distribution are generative models. Generative models can derive concise descriptions of large quantum states [151, 152, 153, 28], and be able to 5.1. Generative quantum machine learning 132

Work Data Algorithm Objective Hidden Units Wiebe et al. [159] Classical Quantum (Classical model) Maximum likelihood Yes Amin et al. [156] Classical Quantum (Quantum model) Maximum likelihood Yes (But cannot be trained) Kieferova et al. [151] Classical (Tomography) Quantum Maximum likelihood (KL-divergence) Yes Kieferova et al. [151] Quantum Quantum Relative entropy No Our work Quantum Quantum (Restricted H) Relative entropy (Variational bound) Yes Our work Quantum Quantum Relative entropy Yes

Figure 5.1: Comparison of previous training algorithms for quantum Boltzmann machines. The models have a varying cost function (objective), contain (are able to be trained with) hidden units, and have different input data (classical or quantum).

prepare states with in principle polynomially sized circuits.

The most general representation for the input, which is the natural analog of a quantum training set, would be a density operator ρ, which is a positive semi- definite trace-1 Hermitian matrix that, roughly speaking, describes a probability distribution over the input quantum state vectors. The goal in quantum genera- tive training is to learn a unitary operator V : |0i 7→ σ, which takes as input a quantum state |0i, and prepares the density matrix σ, by taking a small (polyno- mial) number of samples from ρ, with the condition that under some chosen dis- tance measure D, D(ρ,σ), is small. For example, for the L1 norm this could be

D(ρ,σ) = kρ − σk1. In the quantum information community this task is known as partial tomography [151] or approximate cloning [154]. A similar task is to repli- cate a conditional probability distribution over a label subspace. Such approaches play a crucial role in QAOA-based quantum neural networks.

5.1.1 Related work

Various approaches have been put forward to solve the challenge of generative learning in the quantum domain [151, 153, 155, 156, 157], but to date, all pro- posed solutions suffer from drawbacks due to data input requirements, vanishing gradients, or an inability to learn with hidden units. In this chapter, we present two novel approaches for training quantum Boltzmann machines [151, 158, 156]. Our approaches resolve all restrictions of prior art, and therefore address a major open problem in generative quantum machine learning. Figure 5.1 summarises the key results and prior art. 5.2. Boltzmann and quantum Boltzmann machines 133 5.2 Boltzmann and quantum Boltzmann machines

Boltzmann machines are a physics inspired class of neural network [160, 161, 162, 163, 164, 165, 166, 167], which have gained increasing popularity and found numerous applications over the last decade [168, 166, 169, 170]. More re- cently, they have been used in the generative description of complex quantum sys- tems [171, 172, 173]. Boltzmann machines are an immediate approach when we want to perform generative learning, since they are a physically inspired model which resembles many common quantum systems. Through this similarity, they are also highly suitable to be implemented on quantum computers. To be more precise, a Boltzmann machine is defined by the energy which arises from the interactions in the physical system. As such, it prescribes an energy to every configuration of this system, and then generates samples from a distribution which assigns probabilities to the states according to the exponential of the states energy. This distribution is in classical statistical physics known as the canonical ensemble. The explicit model is given by e−H  Tr e−H σ (H) = Tr = h , (5.1) v h Z Tr e−H where Trh(·) is the partial trace over the so-called Hidden subsystem, an auxillary sub-system which allows the model to build correlations between different nodes in the Visible subsystem. In the case of classical Boltzmann machines, the Hamiltonian H is an energy function that assigns to each state an energy, resulting in a diagonal matrix. In the quantum mechanical case, however, we can obtain superpositions (i.e., linear combinations) of different states, and the energy function therefore turns into the Hamiltonian of the quantum Boltzmann, i.e., a general Hermitian matrix which has off-diagonal entries.

As mentioned earlier, for a quantum system the dimension of the Hamiltonian grows exponentially with the number of units such as qubits or orbitals, denoted n, n× n of the system, i.e., H ∈ C2 2 .

Generative quantum Boltzmann training is then defined as the task of finding the parameters or so-called weights θ, which parameterise the Hamiltonian H = 5.2. Boltzmann and quantum Boltzmann machines 134

H(θ), such that

H = argminH(θ) (D(ρ,σv(H))), where D is again an appropriately chosen distance, or divergence, function. For example, the quantum analogue of an all-visible Boltzmann machine with nv units would then take the form

nv (n) (n) (n) (n0) H(θ) = ∑ θ2n−1σx + θ2nσz + ∑ θ(n,n0)σz σz . (5.2) n=1 n>n0

(n) (n) Here σz and σx are Pauli matrices acting on qubit (unit) n. We now provide a formal definition of the quantum Boltzmann machine.

Definition 5. A quantum Boltzmann machine to be a quantum mechanical system 2n that acts on a tensor product of Hilbert spaces Hv ⊗ Hh ∈ C that correspond to the visible and hidden subsystems of the Boltzmann machine. It further has a n× n Hamiltonian of the form H ∈ C2 2 such that kH − diag(H)k > 0. The quantum Boltzmann machine takes these parameters and then outputs a state of the form  e−H  Trh Tr(e−H ) .

For classical data we hence want to minimise the distance between two dif- ferent distributions, namely the input and output distribution. The natural way to measure such a distance is the Kullback-Leibler (KL) divergence. In the case of the quantum Boltzmann machine, in contrast, we want to minimise the distance between two quantum states (density matrices), and the natural notion of distance changes to the quantum relative entropy:

S(ρ|σv) = Tr(ρ logρ) − Tr(ρ logσv). (5.3)

For diagonal ρ and σv this indeed reduces to the (classical) KL divergence, and it is zero if and only if ρ = σv. For Boltzmann machines with visible units, the gradient of the relative entropy −H can be computed in a straightforward manner because here σv = e /Z and since log(e−H/Z) = −H − log(Z), we can easily compute the matrix derivatives. This 5.3. Training quantum Boltzmann machines 135 is however not the case in general, and no methods are known for the generative training of Boltzmann machines for the quantum relative entropy loss function, if hidden units are present. The main challenge which restricts an easy solution in this −H case is the evaluation of the partial trace in log(Trhe /Z), which prevents us from simplifying the logarithm term which is required for the gradient.

In the following, we will provide two practical methods for training quantum Boltzmann machines with both hidden and visible units. We provide two different approaches for achieving this. The first method only applies to a restricted case, however it allows us to propose a more efficient scheme that relies on a variational upper bound on the quantum relative entropy in order to calculate the derivatives. The second approach is more general method, but requires more resources. For the second approach we rely on recent techniques from quantum simulation to approx- imate the exact expression for the gradient. For this we rely on a Fourier series approximation and high-order divided difference formulas in place of the analytic derivative. Under the assumption that Gibbs state preparation is efficient, which we expect to hold in most practical cases, both methods are efficient.

We note however, that this assumption is indeed not valid in general, in particular, since the possibility of efficient Gibbs state preparation would imply QMA ⊆ BQP which is unlikely to hold.

While the here presented results might be interesting for several readers, in order to preserve the flow and keep the presentation simple, we will include the proofs of technical lemmas in the appendix of the thesis.

5.3 Training quantum Boltzmann machines

We now present methods for training quantum Boltzmann machines. We begin by defining the quantum relative entropy, which we use as cost function for our quantum Boltzmann machine (QBM) with hidden units. Recall that ρ is our  −H −H input distribution, and σ = Trh e /Tr e is the output distribution, where we perform the partial trace over the hidden units. With this, the cost function is hence 5.3. Training quantum Boltzmann machines 136 given by    −H −H Oρ (H) = S ρ Trh e /Tr e , (5.4) where S(ρ|σv) is the quantum relative entropy as defined in eq. 5.3. A few remarks. First, note that it is possible to add a regularisation term in order to penalise unde- sired quantum correlations in the model [151]. Second, since we will only derive gradient based algorithms for the training, we need to evaluate the gradient of the cost function, which requires the gradient of the quantum relative entropy.

As previously mentioned, this is possible to do in a closed-form expression for the case of an all-visible Boltzmann machine, which corresponds to dim(Hh) = 1. The gradient in this case takes the form

∂O (H)  ∂  ρ = −Tr ρ logσ , (5.5) ∂θ ∂θ which can be simplified using log(exp(−H)) = −H and Duhamels formula to ob- tain the following equation for the gradient, denoting with ∂θ := ∂/∂θ,

−H  −H Tr(ρ∂θ H) − Tr e ∂θ H /Tr e . (5.6)

However, this gradient formula is does not hold any longer if we include hidden units. If we hence want to include hidden units, then we need to additionally trace out the subsystem. Doing so results in the majorised distribution from eq. 5.1, which also changes the cost function into the form described in eq. 5.4. Note that H = H(θ) is depending on the training weights. We will adjust these in the training process with the goal to match σ to ρ – the target density matrix. Omitting the Tr(ρ logρ) since it is a constant, we therefore obtain

  ∂Oρ (H) ∂ = −Tr ρ logσv , (5.7) ∂θ ∂θ for the gradient of the objective function.

In the following sections, we now present our two approaches for evaluating the gradient in eq. 5.7. The first one is less general, but gives us an easy imple- 5.3. Training quantum Boltzmann machines 137 mentable algorithm with strong bounds. The second one, on the hand, can be ap- plied to arbitrary problem instances, and therefore presents a general purpose gra- dient based algorithm for training quantum Boltzmann machines with the relative entropy objective. The results are to be expected, since the no-free-lunch theorem suggests that no good bounds can be obtained without assumptions on the problem instance. Indeed, our general algorithm can in principle result in an exponentially worse complexity compared to the specialised one. However, we are expecting that for most practical applications this will not be the case, and our algorithm will in hence be applicable for most cases of practical interest. To our knowledge, we hence present the first algorithm of this kind, which is able to train arbitrary arbi- trary quantum Boltzmann machines efficiently.

5.3.1 Variational training for restricted Hamiltonians

Our first approach optimises a variational upper bound of the objective function, for a restricted but still practical setting. This approach results in a fast and easy way to implement quantum algorithms which, however, is less general due to the input assumptions we are making. These assumptions are required in order to overcome challenges related the evaluation of matrix functions, for which classical calculus fails. Concretely, we express the Hamiltonian in this case as

H = Hv + Hh + Hint, (5.8) i.e., a decomposition of the Hamiltonian into a part acting on the visible layers, the hidden layers and a third interaction Hamiltonian that creates correlations between the two. In particular, we further assume for simplicity that there are two sets of operators {vk} and {hk} composed of D = Wv +Wh +Wint terms such that

Wv Wv+Wh Hv = ∑ θkvk ⊗ I, Hh = ∑ θkI ⊗ hk k=1 k=Wv+1

Wv+Wh+Wint Hint = ∑ θkvk ⊗ hk, [hk,h j] = 0 ∀ j,k, (5.9) k=Wv+Wh+1 5.3. Training quantum Boltzmann machines 138 which implies that the Hamiltonian can in general be expressed as

D H = ∑ θkvk ⊗ hk. (5.10) k=1

We break up the Hamiltonian into this form to emphasise the qualitative difference between the types of terms that can appear in this model. Note that we generally assume throughout this article that vk,hk are unitary operators, which is indeed given in most common instances. By choosing this particular form of the Hamiltonian in (5.9), we want to force the non-commuting terms, i.e., terms for which it holds that the commuta- tor [vk,hk] 6= 0, to act only on the visible units of the model. On the other hand, only commuting Hamiltonian terms act on the hidden register, and we can therefore express the eigenvalues and eigenvectors for the Hamiltonian as

H |vhi ⊗ |hi = λvh,h |vhi ⊗ |hi. (5.11)

Note that here both the conditional eigenvectors and eigenvalues for the visible subsystem are functions of the eigenvector |hi in the hidden register, and we hence denote these as vh,λvh,h respectively. This allows the hidden units to select between eigenbases to interpret the input data while also penalising portions of the accessible Hilbert space that are not supported by the training data. However, since the hidden units commute they cannot be used to construct a non-diagonal eigenbasis. While this division between the visible and hidden layers on the one hand helps us to build an intuition about the model, on the other hand – more importantly – it allows us to derive a more efficient training algorithms that exploit this fact. As previously mentioned, for the first result we rely on a variational bound of the objective function in order to train the quantum Boltzmann machine weights for a Hamiltonian H of the form given in (5.10). We can express this variational bound compactly in terms of a thermal expectation against a fictitious thermal probability distribution. We define this expectation below.

Definition 6. Let H˜h = ∑k θkTr(ρvk)hk be the Hamiltonian acting conditioned 5.3. Training quantum Boltzmann machines 139 on the visible subspace only on the hidden subsystem of the Hamiltonian H :=

∑k θkvk ⊗ hk. Then we define the expectation value over the marginal distribution over the hidden variables h as

˜ (·)e−Tr(ρHh) Eh(·) = ∑ . (5.12) −Tr(ρH˜h) h ∑h e

Using this we derive an upper bound on S in section B.1 of the Appendix, which leads to the following lemma.

Lemma 16. Assume that the Hamiltonian H of the quantum Boltzmann machine takes the form described in eq. 5.10, where θk are the parameters which deter- mine the interaction strength and vk,hk are unitary operators. Furthermore, let hk |hi = Eh,k |hi be the eigenvalues of the hidden subsystem, and Eh(·) as given by Definition 6, i.e., the expectation value over the effective of the visible layer with H˜h := ∑k Eh,kθkvk. Then, a variational upper bound Se of the objective function, meaning that Se(ρ|H) ≥ S(ρ|e−H/Z), is given by

!   Se(ρ|H) := Tr(ρ logρ) + Tr ρ∑Eh Eh,kθkvk + Eh [logαh] + logZ, (5.13) k where e−Tr(ρHeh) αh = −Tr(ρHeh) ∑h e is the corresponding Gibbs distribution for the visible units.

The proof that (5.13) is a variational bound proceeds in two steps. First, we note that for any probability distribution αh

!! !! N N e−∑k Eh,kθkvk / −∑ Eh,kθkvk αh Tr ρ log ∑ e k = Tr ρ log ∑ αh (5.14) h=1 h=1 ∑h0 αh0

We then apply Jensen’s inequality and minimise the result over all αh. This not only verifies that Se(ρ|H) ≥ S(ρ|H) but also yields a variational bound. The details of the proof can be found in Equation B.5 in section B.1 of the appendix. Using the above assumptions we derive the gradient of the variational upper 5.3. Training quantum Boltzmann machines 140 bound of the relative entropy in the Section B.2 of the Appendix. We summarise the result in Lemma 17.

Lemma 17. Assume that the Hamiltonian H of the quantum Boltzmann machine takes the form described in eq. 5.10, where θk are the parameters which deter- mine the interaction strength and vk,hk are unitary operators. Furthermore, let hk |hi = Eh,k |hi be the eigenvalues of the hidden subsystem, and Eh(·) as given by Definition 6, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with H˜h := ∑k Eh,kθkvk. Then, the derivatives of Se with respect to the parameters of the Boltzmann machine are given by

 −H  ∂Se(ρ|H)   ∂H e = Eh Tr ρEh,pvp − Tr . (5.15) ∂θp ∂θp Z

As a sanity check of our results, we can consider the case of no interactions between the visible and the hidden layer. Doing so, we observe that the gradient above reduces to the case of the visible Boltzmann machine, which was treated in [151], resulting in the gradient

e−H  Trρ∂ H − Tr ∂ H , (5.16) θp Z θp

under our assumption on the form of H, ∂θp H = vp.

From Lemma 17, we know the form of the derivatives of the relative entropy w.r.t. any parameter θp via Eq. 5.15. To understand the complexity of evaluating this gradient, we approach each term separately. The second term can easily be −H evaluated by preparing the Gibbs state σGibbs := e /Z, and then evaluating the expectation value of the operator ∂θ j H w.r.t. this Gibbs state. In practice we can do so using amplitude estimation for the Hadamard test [174], which is a standard procedure and we describe it in algorithm 8 in section A.2 of the appendix. The computational complexity of this procedure is easy to evaluate. If TGibbs is the query complexity for the Gibbs state preparation, the query complexity of the whole algorithm including the phase estimation step is then given by O(TGibbs/ε) for an 5.3. Training quantum Boltzmann machines 141

ε-accurate estimate of phase estimation. Next, we derive an algorithm to evaluate the first term, which requires a more involved process. For this, note first that we can evaluate each term Tr(ρvk) inde-   pendently from Eh Eh,p , and individually for all k ∈ [D], i.e., all D dimensions of the gradient. This can be done via the Hadamard test for vk which we recapitulate in section A.2 of the appendix, assuming vk is unitary. More generally, for non-unitary vk we could evaluate this term using a linear combination of unitary operations.   Therefore, the remaining task is to evaluate the terms Eh Eh,p in (5.15), which reduces to sampling elements according to the distribution {αh}, recalling that hp applied to the subsystem has eigenvalues Eh,p. For this we need to be able to cre- ate a Gibbs distribution for the effective Hamiltonian H˜h = ∑k θkTr(ρvk)hk which contains only D terms and can hence be evaluated efficiently as long as D is small, which we can generally assume to be true. In order to sample according to the distribution {αh}, we first evaluate the factors θkTr(ρvk) in the sum over k via the Hadamard test, and then use these in order to implement the Gibbs distribution exp(−H˜h)/Z˜ for the Hamiltonian

˜ Hh = ∑θkTr(ρvk)hk. k

The algorithm is summarised in Algorithm 5. The algorithm is build on three main subroutines. The first one is Gibbs state preparation, which is a known routine which we recapitulate in Theorem 15 in the appendix. The two remaining routines are the Hadamard test and amplitude estima- tion, both are well established quantum algorithms. The Hadamard test, will allow us to estimate the probability of the outcome. This is concretely given by

−E ! 1 1 e h Eh,k Pr(0) = (1 + Rehψ|Gibbs (hk ⊗ I)|ψiGibbs) = 1 + ∑ , (5.18) 2 2 h Z

  i.e., from Pr(0) we can easily infer the estimate of Eh Eh,k up to precision ε for 1  all the k terms, since the last part is equivalent to 2 1 + Eh[Eh,k] . To speed up the time for the evaluation of the probability Pr(0), we use amplitude estimation. We 5.3. Training quantum Boltzmann machines 142

Algorithm 5 Variational gradient estimation - term 1 Input: An upper bound S˜(ρ|H) on the quantum relative entropy, density matrix n× n n× n ρ ∈ C2 2 , and Hamiltonian H ∈ C2 2 . Output: Estimate S of the gradient ∇S˜ which fulfills Thm. 12. 1. Use Gibbs state preparation to create the Gibbs distribution for the effective Hamiltonian H˜h = ∑k θkTr(ρvk)hk with sparsity d. 2. Prepare a Hadamard test state, i.e., prepare an ancilla qubit in the |+i-state and apply a controlled-hk conditioned on the ancilla register, followed by a Hadamard gate, i.e., 1 |φi := (|0i(|ψi + (h ⊗ I)|ψi ) + |1i(|ψi − (h ⊗ I)|ψi )) 2 Gibbs k Gibbs Gibbs k Gibbs (5.17)

−E /2 where |ψi := e √h |hi |φ i is the purified Gibbs state. Gibbs ∑h Z A h B 3. Perform amplitude estimation on the |0i state,we need to implement the amplitude estimation with reflector P := −2|0ih0| + I, and operator G := (2|φihφ| − I)(P ⊗ I). 4. Measure now the phase estimation register which returns an ε˜-estimate of the 1  probability 2 1 + Eh[Eh,k] of the Hadamard test to return 0 6. Repeat the procedure for all D terms and output the first term of ∇S˜. recapitulate this procedure in detail in the suppemental material in section A.1. In this case, we let P := −2|0ih0| + I be the reflector, where I is the identity which is just the Pauli z matrix up to a global phase, and let G := (2|φihφ| − I)(P ⊗ I), for |φi being the state after the Hadamard test prior to the measurement. The operator ±i2θ p G has then the eigenvalue µ± = ±e , where 2θ = arcsin Pr(0), and Pr(0) is the probability to measure the ancilla qubit in the |0i state. Let now TGibbs be the query complexity for preparing the purified Gibbs state (c.f. eq (B.17) in the appendix). We can then perform phase estimation with precision ε for the operator

G requiring O(TGibbs/ε˜) queries to the oracle of H. In section B.2.1 of the appendix we analyse the runtime and error of the above algorithm. The result is summarised in Theorem 12.

Theorem 12. Assume that the Hamiltonian H of the quantum Boltzmann machine takes the form described in eq. 5.10, where θk are the parameters which deter- mine the interaction strength and vk,hk are unitary operators. Furthermore, let hk |hi = Eh,k |hi be the eigenvalues of the hidden subsystem, and Eh(·) as given by 5.3. Training quantum Boltzmann machines 143

Definition 6, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with H˜h := ∑k Eh,kθkvk, and suppose that I  H˜h with bounded ˜ ˜ D spectral norm Hh(θ) ≤ kθk1, and let Hh be d-sparse. Then S ∈ R can be computed for any ε ∈ (0,max{1/3,4maxh,p |Eh,p|}) such that

˜ S − ∇S max ≤ ε, (5.19) with p Dkθk dn2  Oe ξ 1 , (5.20) ε queries to the oracle OH and Oρ with probability at least 2/3, where kθk1 is the sum of absolute values of the parameters of the Hamiltonian, ξ := max[N/z,Nh/zh], n n N = 2 ,Nh = 2 h , and z,zh are known lower bounds on the partition functions for the Gibbs state of H and H˜h respectively.

Theorem 12 shows that the computational complexity of estimating the gradi- ent grows the closer we get to a pure state, since for a pure state the inverse temper- ature β → ∞, and therefore the norm kH(θ)k → ∞, as the Hamiltonian is depending on the parameters, and hence the type of state we describe. In such cases we typ- ically would rely on alternative techniques. However, this cannot be generically improved because otherwise we would be able to find minimum energy configura- √ tions using a number of queries in o( N), which would violate lower bounds for Grover’s search. Therefore more precise statements of the complexity will require further restrictions on the classes of problem Hamiltonians to avoid lower bounds imposed by Grover’s search and similar algorithms.

5.3.2 Gradient based training for general Hamiltonians While our first scheme for training quantum Boltzmann machines is only applica- ble in case of a restricted Hamiltonian, the second scheme, which we will present next, holds for arbitrary Hamiltonians. In order to calculate the gradient, we use higher order divided difference estimates for the relative entropy objective based on function approximation schemes. For this we generate differentiation formulas by differentiating an interpolant. The main ideas are simple: First we construct an 5.3. Training quantum Boltzmann machines 144 interpolating polynomial from the data. Second, an approximation of the derivative at any point is obtained via a direct differentiation of the interpolant.

Concretely we perform the following steps.

We first approximate the logarithm via a Fourier-like approximation, i.e., logσv → logK,M σv, where the subscripts K,M indicate the level of truncation similar to [175]. This will yield a Fourier-like series in terms of σv, i.e.,

∑m cm exp(imπσv). Next, we need to evaluate the gradient of the function

 ∂  Tr ρ log (σv) . ∂θ K,M

Taking the derivative yields many terms of the form

Z 1 ∂σ dse(ismπσv) v e(i(1−s)mπσv), (5.21) 0 ∂θ as a result of the Duhamel’s formula for the derivative of exponentials of operators (c.f., Sec. 2 in the mathematical preliminaries). Each term in this expansion can furthermore be evaluated separately via a sampling procedure, since the terms in Eq. 5.21 can be approximated by

  (ismπσv) ∂σv (i(1−s)mπσv) Es e e . ∂θ

Furthermore, since we only have a logarithmic number of terms, we can combine the results of the individual terms via classical postprocessing once we have evalu- ated the trace.

Now, we apply a divided difference scheme to approximate the gradient term

∂σv ∂θ which results in an interpolation polynomial Lµ, j of order l (for l being the number of points at which we evaluate the function) in σv which we can efficiently evaluate.

However, evaluating these terms is still not trivial. The final step consists hence of implementing a routine which allows us to evaluate these terms on a quantum 5.3. Training quantum Boltzmann machines 145

Algorithm 6 Gradient estimation via series approximations 2n×2n 2nv ×2nv Input: Density matrices ρ ∈ C and σv ∈ C , precalculated parameters K,M and Fourier-like series for the gradient as described in eq. B.57. Output: Estimate G of the gradient ∇θ Tr(ρ logσv) with guarantees in Thm. 12.

1. Prepare the |+i ⊗ ρ state for the Hadamard test. 2. Conditionally on the first qubit apply sample based Hamiltonian simulation to 0 isπm iπm ( ) i(1−s)πm ρ, i.e., for U := e 2 σv e 2 σv θ j e 2 σv , apply |0ih0| ⊗ I + |1i1 ⊗U. 3. Apply another Hadamard gate to the first qubit. 4. Repeat the above procedure and measure the final state each time and return the averaged output. device. In order to do so, we once again make use of the Fourier series approach.

This time we take the simple idea of approximating the density operator σv by the 0 series of itself, i.e., σv ≈ F(σv) := ∑m0 cm0 exp(imπm σv), which we can implement conveniently via sample based Hamiltonian simulation [13, 57]. Following these steps we obtain the expression in Eq. B.57. The real part of

M M µ 1 2 ic c 0 m h  isπm iπm0 i(1−s)πm i m ˜m π 0 σv σv(θ j) σv ∑ ∑ ∑ Lµ, j(θ)Es∈[0,1] Tr ρe 2 e 2 e 2 . 0 2 m=−M1 m =−M2 j=0 (5.22) 0 then approximates ∂θ Tr(ρ logσv) with at most ε error, where Lµ, j is the deriva- tive of the interpolation polynomial which we obtain using divided differences, and

{ci}i,{c˜j} j are coefficients of the approximation polynomials, which can efficiently be evaluated classically. We can evaluate each term in the sum separately and com- bine the results then via classical post-processing, i.e., by using the quantum com- puter to evaluate terms containing the trace. The main challenge for the algorithmic evaluation hence to compute the terms

0  isπm iπm ( ) i(1−s)πm  Tr ρe 2 σv e 2 σv θ j e 2 σv . (5.23)

Evaluating this expression is done through Algorithm 6, relies on two estab- lished subroutines, namely sample based Hamiltonian simulation [13, 57], and the Hadamard test which we discussed earlier. Note that the sample based Hamiltonian simulation approach introduces an additional εh-error in trace norm, which we also 5.3. Training quantum Boltzmann machines 146 need to take into account in the analysis. In section B.3.1 of the appendix we derive the following guarantees for Algorithm. 6.

Theorem 13. Let ρ,σv being two density matrices, kσvk < 1/π, and we have access to an oracle OH that computes the locations of non-zero matrix elements in each row and their values for the d-sparse Hamiltonian H(θ) (as per [80]) and an oracle Oρ which returns copies of purified density matrix of the data ρ, and ε ∈ (0,1/6) an error parameter. With probability at least 2/3 we can obtain an estimate G of the D gradient w.r.t. θ ∈ R of the relative entropy ∇θ Tr(ρ logσv) such that

k∇θ Tr(ρ logσv) − G kmax ≤ ε, (5.24) with

! rN DkH(θ)kdµ5γ O˜ , (5.25) z ε3

γ −nv queries to OH and Oρ , where µ ∈ O(nh + log(1/ε)), k∂θ σvk ≤ e , kσvk ≥ 2 for nv being the number of visible units and nh being the number of hidden units, and

O˜ (poly(γ,nv,nh,log(1/ε))) classical precomputation.

In order to obtain the bounds in Theorem 13, we decompose the total error into the errors that we incur at each step of the approximation scheme,

Tr( log ) − Tr logs ˜  ≤ ( ) · [log − logs ˜ ] ∂θ ρ σv ∂θ ρ K1,M1 σv ∑σi ρ ∂θ σv K1,M1 σv i ≤ ( ) · [log − log ] ∑σi ρ ∂θ σv K1,M1 σv i + [log − log ˜ ] + [log ˜ − logs ˜ ] . (5.26) ∂θ K1,M1 σv K1,M1 σv ∂θ K1,M1 σv K1,M1 σv

Then bounding each term separately and adjusting the parameters to obtain an over- all error of ε allows us to obtain the above result. We are hence able to use this 5.4. Conclusion 147 procedure to efficiently obtain gradient estimates for a QBM with hidden units, while making minimal assumptions on the input data.

5.4 Conclusion

Generative models may play a crucial role in making quantum machine learning algorithms practical, as they yield concise models for complex quantum states that have no known a priori structure. In this chapter, we solved an outstanding problem in the field of generative quantum modelling: We overcame the previous inability to train quantum generative models with a quantum relative entropy objective for a QBM with hidden units. In particular, the inability to train models with hidden units was a substantial drawback.

Our results show that, given an efficient subroutine for preparing Gibbs states and an efficient algorithm for computing the matrix elements of the Hamiltonian, one can efficiently train a quantum Boltzmann machine with the quantum relative entropy.

While a number of problems remain with our algorithms, namely (1) that we have not given a lower bound for the query complexity, and hence cannot answer the question whether a linear scaling in kθk1 is optimal, (2) that our results are only efficient if the complexity of performing the Gibbs state preparation is low, we still show a first efficient algorithm for training such generative quantum models. This is important, as such models might in the future open up avenues for fast state preparation, and could therefore overcome some of the drawbacks related to the data access, since the here presented model could also be trained in an online manner, i.e., whenever a data point arrives it could be used to evaluate and update the gradient. Once the model is trained, we could then use it to generate states from the input distribution and use these in a larger QML routine. With further optimisation, it is our hope that quantum Boltzmann machines may not be just a theoretical tool that can one day be used to model quantum states, but that they can be used to show an early use of quantum computers for modelling quantum or classical data sets in near term quantum hardware. 5.4. Conclusion 148

The here discussed results for learning with quantum data lead also to a more philosophical question, which was raised to us by Iordanis Kerenidis: Is the learning of quantum distributions from polynomially-sized circuits truly a quantum problem, or is it rather classical machine learning?

In the following we now briefly attempt to answer this question as well.

Much of the attention of the QML literature for learning of quantum data has been focused on learning a circuit that can generate a given input quantum distribution. However, such distributions may be assumed to be derived from a polynomially-sized quantum circuit, which implies that the data has an efficient (polynomially-sized) classical description, namely the quantum circuit that gener- ated it (or the circuits).

Under the assumption that there exist polynomially-sized quantum circuits, which generate the quantum data that we aim to learn, the problem can, however, be seen as a classical machine learning problem. This is because the learning can happen entirely with classical means by learning the gates of the circuit with the slight addition that one can run the quantum circuits as well. Note that running the quantum circuit can be seen as the query of a cost function or a sample generation, depending on the context. In this context, we indeed do not anticipate that a differ- entiation can be made between classical or quantum data and hence the learning.

The difference might still occur when we are dealing with quantum states that are derived from inaccessible (black-box) quantum processes. In this case, we can- not guarantee that there exists a quantum circuit that has a polynomially sized cir- cuit, which is able to generate the data. However, we are still able to learn an approximation to these states, see e.g. [176]. In this scenario, we can describe the problem again through a two-step process as a classical machine learning problem. First, we can for each quantum state from the data set learn a circuit that approx- imates the input state. Then, we take the set of circuits as an input model to the classical machine learning routine, and learn a classification of these. In this sense, the problem is entirely classical, with the difference that we rely on a quantum cir- cuit for the evaluation of the cost function (e.g., the distance of the quantum states). 5.4. Conclusion 149

Given this, we anticipate that for certain tasks, such as classification tasks, no exponential advantage should be possible for quantum data. Chapter 6

Conclusions

In this thesis, we have investigated the area of quantum machine learning, the com- bination of machine learning and quantum mechanics (in general), and in particular here the use of quantum computers to perform machine learning. We have partic- ularly investigated supervised quantum machine learning methods, and answered two important questions:

The first research question 1 asked what is the performance of supervised quantum machine learning algorithms under the common assumptions of statisti- cal learning theory. Our analysis indicated that most of the established and existing quantum machine learning algorithms, which are all of a theoretical nature, do not offer any advantage over their classical counterparts under common assumptions. In particular, we gave the first results that rigorously study the performance of quan- tum machine learning algorithms in light of the target generalisation accuracy of the learning algorithm, which is of a statistical nature. We show that any quan- tum machine learning algorithm will exhibit a polynomial scaling in the number of training samples – immediately excluding any exponential speedup. A more in- depth analysis of the algorithms demonstrated that indeed, taking this into account, many of the quantum machine learning algorithms do not offer any advantage over their classical counterparts.

This insight also allows us to immediately understand that noise in current generations of quantum computers significantly limits our ability to obtain good solutions, since it would immediately introduce an additional linear term in error 151 decomposition of Eq. 3.22.

A range of questions crucially remain open. First, while we better understand the limitations of quantum algorithms due to the generalisation ability, we are un- able to investigate these bounds with respect to the complexity of the Hypothesis space. A core open question and possible avenue forward to better understand the discrepancy between classical and quantum methods is the study of the complexity c(H ), i.e., the measure of the complexity of H (such as the VC dimension, cov- ering numbers, or the Rademacher complexity [47, 44]). This might indeed clarify whether quantum kernels or other approaches could be used to improve the perfor- mance of quantum algorithms. First attempts in this direction exist in the works of [177] and [178]. In particular, the latter identified with the tools of statistical learning theory under which circumstances a quantum kernel can give an advan- tage. While these results are very encouraging, [178] also demonstrates that for most problems a quantum advantage for learning problems is highly unlikely. In particular, the questions remain for what type of practical problems such a quantum kernel would give an advantage.

Another open question of this thesis is the effect of the condition number. In practice, conditioning is a crucial aspect of many algorithms, and while we give here an indication about possible outcomes, a clear quantum advantage would need to take the condition number also into account. In particular, since existing algo- rithms, such as the one by Clader et al. [11], have been shown not to work [12], new approaches are needed to overcome the drawbacks of current quantum algorithms compared to their classical counterparts.

The second research question 2 on the other hand investigated the challenge posed by the oracles that are used to establish exponentially faster (theoretical) com- putational complexities, i.e., run times. Most supervised quantum machine learn- ing methods assume the existence of a logarithmic state preparation process, i.e., a quantum equivalent of a random access memory called qRAM [6]. We inspected this integral part of quantum machine learning under the lens of randomised nu- merical linear algebra, and asked what is the comparative advantage of quantum 152 algorithms over classical ones if we can efficiently sample from the data in both cases.

We established on the example of Hamiltonian simulation, that given the ability to sample efficiently from the input data – according to some form of leverage – we are able to obtain computational complexities for Hamiltonian simulation that are independent of the dimension of the input matrix. To do this, we first derived a quantum algorithm for Hamiltonian simulation, which is based on a quantum random memory. Next, we developed a classical algorithm that performed a similar though harder task, but which also did not have any explicit dependency on the dimension. Our results were some of the first that introduced randomised methods into the classical simulation of and one of the first to study the comparative advantage of classical randomised algorithms over quantum ones.

While we didn’t manage to extend our algorithms immediately to any appli- cation in quantum machine learning, shortly after we published our results, Ewin Tang derived the now famous quantum-inspired algorithm for recommendation sys- tems [17], which uses randomised methods similar to the ones we introduced for Hamiltonian simulation in the context of quantum machine learning. We hence also put our results in context by adding a discussion regarding the literature of quantum-inspired algorithms.

Notably, the larger question remains if such memory structures such as qRAM will ever be practically feasible. Older results [74] indicate that even small errors can diminish a quantum advantage for search, and therefore could also crucially affect quantum machine learning algortihms. While recent research [75] claims that such errors could be sustained by quantum machine learning algorithms, we believe this question is far from clear today.

A further constraint arises from the actual costs – end-to-end – to run a quan- tum algorithm using such a memory structure. In reality, the cost to build and run such devices still need to be significantly reduced in order to make any advantage economic.

In the final chapter of the thesis, we then turned to future avenues for quan- 153 tum machine learning. Since we established that the state preparation is one of the biggest bottlenecks for successful applications and a possible advantage of quantum machine learning over classical methods, we hence investigate generative quantum models. We believe that these might hold the key to making quantum machine learning models useful in the intermediate future by replacing potentially infeasible quantum memory models such as qRAM. We resolve a major open question in the area of generative quantum modelling. Namely, we derive efficient algorithms for the gradient based training of Quantum Boltzmann Machines with hidden layers us- ing the quantum relative entropy as an objective function. We present in particular two algorithms. One that applies in a more restricted case and is more efficient. And a second one that applies to arbitrary instances but is less efficient. Our results hold under the assumption that we can efficiently prepare Gibbs distributions, which is not true in general, but to be expected to hold for most practical applications. A key question that is not treated in this thesis is how can we establish good benchmarks for new quantum algorithms such as in quantum machine learning. This slightly extends the more general questions on how to judge normal optimisa- tion algorithms for exactly the statistical guarantees that we aim to achieve. In practice, it might be required to perform the following steps to evaluate a new algorithm:

1. Obtain the theoretical time complexity of an algorithm (asymptotic).

2. Determine the dependencies (error, input dimensions, condition number, etc.) that are in common with the classical algorithm and identify how these com- pare.

3. Use known bounds on the target generalisation accuracy and see if the advan- tage remains.

4. Estimate overheads (pre-factors) for the algorithm using the best known re- sults to go from the asymptotic scaling to a concrete cost.

5. Use estimates of condition number and other properties for some common 154

machine learning datasets to see if a practical advantage is feasible (since the overhead).

6. Take other error sources into account and check if the advantage still remains (i.e., take account error rates in the device and how they affect the bounds). This must also include overheads from error correction.

Benchmarks as the above one, which are of a theoretical nature, are the only option to establish a quantum advantage unless fully error-corrected devices are available for testing. However, defining a common way to perform such bench- marks remains an open question in the quantum algorithms community. More recently, the community started to investigate the area of so-called quan- tum kernel methods. These might enable us to find new ways to leverage quantum computing to better identify patterns in certain types of data. From a practical point of view, however, the ultimate test for any method is the demonstrated performance on a benchmark data set, such as MNIST. While today’s quantum computers appear to be far away from this, it will be the only true way to establish a performance advantage for quantum machine learning over classical methods. One interesting aspect that we have not considered in this thesis is the recon- ciliation of the research questions one and two, namely how classical randomised algorithms and quantum algortihms fit into the picture of statistical learning theory. It appears that a much fairer comparison is possible between quantum and classical randomised methods. Indeed, there exist classical algorithms for machine learning which leverage randomisation, optimal rates (generalisation accuracy), and other aspects such as preconditioning to obtain performance improvements. Such meth- ods typically lead to a better time-scaling while also resulting in optimal learning rates - take for example FALCON [10]. Whether quantum machine learning will become useful in the future using gen- erative models for the data access, or whether the polynomial speedup of quantum machine learning algorithms over their classical (quantum-inspired) versions will hold remains an open question. 155

Ultimately, we believe that these questions can only truly be verified through implementation and benchmarking. Due to the current state of quantum comput- ing hardware, we however believe that such benchmarks will unlikely be possible in the next years, and that researchers will need to continue to make theoretical progress long before we will be able to resolve the questions of a ‘quantum advan- tage’ through an actual benchmark. Algorithm 7 Amplitude estimation n n Input: Density matrix ρ, unitary operator U : C2 → C2 , qubit registers |0i ⊗ |0i⊗n. Output: An ε˜ close estimate of Tr(Uρ). 1. Initialize two registers of appropriate sizes to the state |0iA |0i, where A is a unitary transformation which prepares the input state, i.e., |ψi = A |0i. 2. Apply the quantum Fourier transform QFT : |xi → √1 N−1 e2πixy/N |yi for N N ∑y=0 0 ≤ x < N, to the first register. j 3. Apply ΛN(Q) to the second register, i.e., let ΛN(U) : | ji|yi → | ji(U |yi) † for 0 ≤ j < N, then we apply ΛN(Q) where Q := −A S0A St is the Grover’s operator. † 4. Apply QFTN to the first register. 2 θ˜ 5. Returna ˜ = sin (π N ).

Appendix A

Appendix 1: Quantum Subroutines

A.1 Amplitude estimation

In the following we describe the established amplitude estimation algorithm [179]:

Algorithm 7 describes the amplitude estimation algorithm. The output is an

ε-close estimate of the target amplitude. Note that in step (3), S0 changes the sign of the amplitude if and only if the state is the zero state |0i, and St is the sign-flip operator for the target state, i.e., if |xi is the desired outcome, then St := I −2|xihx|.

The algorithm can be summarized as the unitary transformation

 †  (QFT ⊗ I)ΛN(Q)(QFTN ⊗ I) A.2. The Hadamard test 157

Algorithm 8 Variational gradient estimation - term 2 n n Input: Density matrix ρ, unitary operator U : C2 → C2 , qubit registers |0i ⊗ |0i⊗n. Output: An ε˜ close estimate of Tr(Uρ). 1. Prepare the first qubits |+i state and initialize the second register to 0. 2. Use an appropriate subroutine to prepare the density matrix ρ on the second register to obtain the state |+ih+| ⊗ ρ. 3. Apply a controlled operation |0ih0| ⊗ I2n + |1ih1| ⊗ U, followed by a Hadamard gate. 4. Perform amplitude estimation on the |0i state, via the reflector P := −2|0ih0| + I, and operator G := (2ρ − I)(P ⊗ I). 5. Measure now the phase estimation register which returns an ε˜-estimate of the 1 probability 2 (1 + Re[Tr(Uρ)]) of the Hadamard test to return 0. 6. Repeat the procedure for an additional controled application of exp(iπ/2) in step (3) to recover also the imaginary part of the result. 7. Return the real and imaginary part of the probability estimates. applied to the state |0iA |0i, followed by a measurement of the first register and classical post-processing returns an estimate θ˜ of the amplitude of the desired out- come such that |θ −θ˜| ≤ ε with probability at least 8/π2. The result is summarized in the following theorem, which states a slightly more general version.

Theorem 14 (Amplitude Estimation [179]). For any positive integer k, the Ampli- tude Estimation Algorithm returns an estimate a˜ (0 ≤ a˜ ≤ 1) such that

p a(1 − a) π2 |a˜ − a| ≤ 2πk + k2 N N2 with probability at least 8 ≈ 0.81 for k = 1 and with probability greater than 1 − π2 1 2(k−1) for k ≥ 2. If a = 0 then a˜ = 0 with certainty, and and if a = 1 and N is even, then a˜ = 1 with certainty.

Notice that the amplitude θ can hence be recovered via the relation θ = √ arcsin θa as described above which incurs an ε-error for θ (c.f., Lemma 7, [179]).

A.2 The Hadamard test

Here we present an easy subroutine to evaluate the trace of products of unitary operators U with a density matrix ρ, which is known as the Hadamard test. A.2. The Hadamard test 158

Note that this procedure can easily be adapted to be used for ρ being some Gibbs distribution. We then would use a Gibbs state preparation routine in step (2). For example for the evaluation of the gradient of the variational bound, we require this subroutine to evaluate U = ∂θ H for ρ being the Gibbs distribution correspond- ing to the Hamiltonian H. Appendix B

Appendix 2: Deferred proofs

First for convenience, we formally define quantum Boltzmann machines below.

Definition 7. A quantum Boltzmann machine to be a quantum mechanical system 2n that acts on a tensor product of Hilbert spaces Hv ⊗ Hh ∈ C that correspond to the visible and hidden subsystems of the Boltzmann machine. It further has a n× n Hamiltonian of the form H ∈ C2 2 such that kH − diag(H)k > 0. The quantum Boltzmann machine takes these parameters and then outputs a state of the form  e−H  Trh Tr(e−H ) .

Given this definition, we are then able to discuss the gradient of the relative entropy between the output of a quantum Boltzmann machines and the input data that it is trained with.

B.1 Derivation of the variational bound

Proof of Lemma 16. Recall that we assume that the Hamiltonian H takes the form

H := ∑θkvk ⊗ hk, k where vk and hk are operators acting on the visible and hidden units respectively and we can assume hk = dk to be diagonal in the chosen basis. Under the assumption that [hi,h j] = 0,∀i, j, c.f. the assumptions in (5.9), there exists a basis {|hi} for the hidden subspace such that hk |hi = Eh,k |hi. With these assumptions we can hence B.1. Derivation of the variational bound 160 reformulate the logarithm as

!  −H −∑ θkvk⊗hk 0 0 logTrh e = log ∑ hv,h|e k |v ,hi|vihv | (B.1) v,v0,h ! = log ∑ hv|e−∑k Eh,kθkvk |v0i|vihv0| (B.2) v,v0,h ! = log ∑e−∑k Eh,kθkvk , (B.3) h where it is important to note that vk are operators and we hence just used the matrix representation of these in the last step. In order to further simplify this expres- sion, first note that each term in the sum is a positive semi-definite operator. In particularly, note that the matrix logarithm is operator concave and operator mono- tone, and hence by Jensen’s inequality, for any sequence of non-negative number

{αi} : ∑i αi = 1 we have that

 N  N ∑ αiUi ∑ αi log(Ui) log i=1 ≥ i=1 . ∑ j α j ∑ j α j and since we are optimizing Tr(ρ logρ)−Tr(ρ logσv) we hence obtain for arbitrary choice of {αi}i under the above constraints,

!! !! N N e−∑k Eh,kθkvk / −∑ Eh,kθkvk αh Tr ρ log ∑ e k = Tr ρ log ∑ αh h=1 h=1 ∑h0 αh0   ∑ αh ∑ Eh,kθkvk + ∑ αh logαh ≥ −Tr ρ h k h . ∑h0 αh0 (B.4)

Hence, the variational bound on the objective function for any {αi}i is

Oρ (H) =Tr(ρ logρ) − Tr(ρ logσv)   ∑ αh ∑ Eh,kθkvk + ∑ αh logαh ≤ Tr(ρ logρ) + Tr ρ h k h + logZ =: S˜ ∑h0 αh0 (B.5) B.2. Gradient estimation 161 B.2 Gradient estimation

For the following result we will rely on a variational bound in order to train the quantum Boltzmann machine weights for a Hamiltonian H of the form given in (5.10). We begin by proving Lemma 17 in the main work, which will give us an upper bound for the gradient of the relative entropy.

Proof of Lemma 17. We first derive the gradient of the normalization term (Z) in the relative entropy, which can be trivially evaluated using Duhamels formula to obtain  −H  ∂ −H ∂H e  logTr e = −Tr = −Tr σ∂θp H . ∂θp ∂θp Z Note that we can easily evaluate this term by first preparing the Gibbs state −H σGibbs := e /Z and then evaluating the expectation value of the operator ∂θp H w.r.t. the Gibbs state, using amplitude estimation for the Hadamard test. If TGibbs is the query complexity for the Gibbs state preparation, the query complexity of the whole algorithm including the phase estimation step is then given by O(TGibbs/ε˜) for an ε˜-accurate estimate of phase estimation. Taking into account the desired ac- curacy and the error propagation will hence straight forward give the computational complexity to evaluate this part. We now proceed with the gradient evaluations for the model term. Using the varia- tional bound on the objective function for any {αi}i, given in eq. B.5, we obtain the gradient

! ∂S˜  ∂H e−H  ∂ ∂ = −Tr + Tr ρ ∑αh ∑Eh,kθkvk + ∑αh logαh ∂θp ∂θp Z ∂θp h k ∂θp h (B.6) ! !  ∂H e−H  ∂ = −Tr + ∑αhTr ρ ∑Eh,kθkvk + ∑αh logαh (B.7) ∂θp Z ∂θp h k h where the first term results from the partition sum. The latter term can be seen as a new effective Hamiltonian, while the latter term is the entropy. The latter term hence resembles the free energy F(h) = E(h) − TS(h), where E(h) is the mean energy of  the effective system with energies E(h) := Tr ρ ∑k Eh,kθkvk , T the temperature B.2. Gradient estimation 162 and S(h) the Shannon entropy of the αh distribution. We now want to choose these

αh terms to minimize this variational upper bound. It is well-established in statis- tical physics, see for example [180], that the distribution which maximizes the free energy is the Boltzmann (or Gibbs) distribution, i.e.,

˜ e−Tr(ρHh) αh = , −Tr(ρH˜h) ∑h e where H˜h := ∑k Eh,kθkvk is a new effective Hamiltonian on the visible units, and the

{αi} are given by the corresponding Gibbs distribution for the visible units.

Therefore, our gradients can be taken with respect to this distribution and the  bound above, where Tr ρH˜h is the mean energy of the the effective visible system w.r.t. the data-distribution. For the derivative of the energy term we obtain

! ∂ ∑αhTr ρ ∑Eh,kθkvk = (B.8) ∂θp h k    ˜   = ∑ αh Eh0 Tr ρEh0,pvp − Tr ρEh,pvp Tr ρHh + αhTr ρEh,pvp h (B.9)       = Eh Eh0 Tr ρEh0,pvp − Tr ρEh,pvp Tr ρH˜h + Tr ρEh,pvp , (B.10) while the entropy term yields

∂     ˜   ∑αh logαh = ∑αh Tr ρEh,pvp − Eh0 Tr ρEh0,pvp Tr ρHh − Tr ρEh,pvp ∂θp h h      −H˜h + ∑αh Tr ρEh,p − Eh0 Tr ρEh0,pvp logTr e h   + Eh0 Tr ρEh0,pvp . (B.11)

This can be further simplified to

   ˜  ∑αh Tr ρEh,pvp − Eh0 Tr ρEh0,pvp Tr ρHh (B.12) h      =Eh Tr ρEh,pvp − Eh0 Tr ρEh0,pvp Tr ρH˜h . (B.13) B.2. Gradient estimation 163

Te resulting gradient for the variational bound for the visible terms is hence given by

 −H  ∂S˜   ∂H e = Eh Tr ρEh,pvp − Tr (B.14) ∂θp ∂θp Z

Notably, if we consider no interactions between the visible and the hidden layer, then indeed the gradient above reduces recovers the gradient for the visible Boltzmann machine, which was treated in [151], resulting in the gradient

e−H  Trρ∂ H − Tr ∂ H , θp Z θp

under our assumption on the form of H, ∂θp H = vp. B.2.1 Operationalizing the gradient based training From Lemma 17, we know that the derivative of the relative entropy w.r.t. any pa- rameter θp can be stated as

 −H  ∂S˜   ∂H e = Eh Eh,p Tr(ρvp) − Tr . (B.15) ∂θp ∂θp Z

Since evaluating the latter part is, as mentioned above, straight forward, we give here an algorithm for evaluating the first part.

Now note that we can evaluate each term Tr(ρvk) individually for all k ∈ [D], i.e., all D dimensions of the gradient via the Hadamard test for vk, assuming vk is uni- tary. More generally, for non-unitary vk we could evaluate this term using a linear combination of unitary operations. Therefore, the remaining task is to evaluate the   terms Eh Eh,p in (B.15), which reduces to sampling according to the distribution

{αh}. For this we need to be able to create a Gibbs distribution for the effective Hamilto- nian H˜h = ∑k θkTr(ρvk)hk which contains only D terms and can hence be evaluated efficiently as long as D is small, which we can generally assume to be true. In order to sample according to the distribution {αh}, we first evaluate the factors θkTr(ρvk) B.2. Gradient estimation 164 in the sum over k via the Hadamard test, and then use these in order to implement the Gibbs distribution exp(−H˜h)/Z˜ for the Hamiltonian

˜ Hh = ∑θkTr(ρvk)hk. k

In order to do so, we adapt the results of [175] in order to prepare the corresponding Gibbs state (although alternative methods can also be used [181, 182, 183]).

Theorem 15 (Gibbs state preparation [175]). Suppose that I  H and we are given N×N K ∈ R+ such that kHk ≤ 2K, and let H ∈ C be a d-sparse Hamiltonian, and we know a lower bound z ≤ Z = Tre−H. If ε ∈ (0,1/3), then we can prepare a purified Gibbs state |γiAB such that

−H e TrB [|γihγ| ] − ≤ ε (B.16) AB Z using ! rN K  1 O˜ Kd log log (B.17) z ε ε queries, and

! rN K  1 K  O˜ Kd log log log(N) + log5/2 (B.18) z ε ε ε gates.

Note that by using the above algorithm with H˜sim/2, the preparation of the purified Gibbs state will leave us in the state

e−Eh/2 |ψiGibbs := ∑ √ |hiA |φhiB , (B.19) h Z where |φ jiB are mutually orthogonal trash states, which can typically be chosen to be |hi, i.e., a copy of the first register, which is irrelevant for our computation, and ˜ |hiA are the eigenstates of H. Tracing out the second register will hence leave us in B.2. Gradient estimation 165 the corresponding Gibbs state

e−Eh σh := ∑ |hihh|A , h Z and we can hence now use the Hadamard test with input hk and σh, i.e., the operators   on the hidden units and the Gibbs state, and estimate the expectation value Eh Eh,k . We provide such a method below.

Proof of Theorem 12. Conceptually, we perform the following steps, starting with Gibbs state preparation followed by a Hadamard test coupled with amplitude esti- mation to obtain estimates of the probability of a 0 measurement. The proof follows straight from the algorithm described in 5. From this we see that the runtime constitutes the query complexity of preparing the Gibbs state

r ! 2n kH(θ)kd kH(θ)k 1 TV = O˜ log log , Gibbs Z ε ε˜ ε˜ where 2n is the dimension of the Hamiltonian, as given in Theorem 12 and combin- ing it with the query complexity of the amplitude estimation procedure, i.e., 1/ε. However, in order to obtain a final error of ε, we will also need to account for the error in the Gibbs state preparation. For this, note that we estimate terms of the form h V i h V V i TrAB hψ|Gibbs (hk ⊗ I)|ψiGibbs = TrAB (hk ⊗ I)|ψiGibbs hψ|Gibbs . We can hence estimate the error w.r.t. the true Gibbs state σGibbs as

h V V i TrAB (hk ⊗ I)|ψiGibbs hψ|Gibbs − TrA [hkσGibbs]

h h V V i i = TrA hkTrB |ψiGibbs hψ|Gibbs − hkσGibbs h i V V ≤ ∑σi(hk) TrB |ψiGibbs hψ|Gibbs − σGibbs i

≤ ε˜ ∑σi(hk). (B.20) i

For the final error being less then ε, the precision we use in the phase estimation −n−1 procedure, we hence need to set ε˜ = ε/(2∑i σi(hk)) ≤ 2 ε, reminding that hk is B.2. Gradient estimation 166 unitary, and similarly precision ε/2 for the amplitude estimation, which yields the query complexity of

r          Nh kH(θ)kd kH(θ)k 1 kH(θ)k 1 O n2 + nlog + nlog + log log , zh ε ε ε ε ε r  2  Nh n kθk1d ∈ Oe . (B.21) zh ε where we denote with A the hidden subsystem with dimensionality 2nh ≤ 2N, on which we want to prepare the Gibbs state and with B the subsystem for the trash state.

Similarly, for the evaluation of the second part in (B.15) requires the Gibbs state preparation for H, the Hadamard test and phase estimation. Similar as above we meed to take into account the error. Letting the purified version of the Gibbs state for H be given by |ψiGibbs, which we obtain using Theorem 15, and σGibbs be the perfect state, then the error is given by

TrAB [(vk ⊗ hk ⊗ I)|ψiGibbs hψ|Gibbs] − TrA [(vk ⊗ hk)σGibbs] h h V V i i = TrA (vk ⊗ hk)TrB |ψiGibbs hψ|Gibbs − (vk ⊗ hk)σGibbs h i V V ≤ ∑σi(vk ⊗ hk) TrB |ψiGibbs hψ|Gibbs − hkσGibbs i

≤ ε˜ ∑σi(vk ⊗ hk), (B.22) i where in this case A is the subsystem of the visible and hidden subspace and B the trash system. We hence upper bound the error similar as above and introducing

ξ := max[N/z,Nh/zh] we can find a uniform bound on the query complexity for evaluating a single entry of the D-dimensional gradient is in

  2  p n kθk1d Oe ζ , ε thus we attain the claimed query complexity by repeating this procedure for each of the D components of the estimated gradient vector S .

Note that we also need to evaluate the terms Tr(ρvk) to precision εˆ ≤ ε, which B.2. Gradient estimation 167 though only incurs an additive cost of D/ε to the total query complexity, since this step is required to be performed once. Note that |Eh(hp)| ≤ 1 because hp is assumed to be unitary. To complete the proof we only need to take the success probability of the amplitude estimation process into account. For completeness we state the algorithm in the appendix and here refer only to Theorem 14, from which we have that the procedure succeeds with probability at least 8/π2. In order to have a failure probability of the final algorithm of less than 1/3, we need to repeat the procedure for all d dimensions of the gradient and take the median. We can bound the number of repetitions in the following way.

Let n f be the number of instances of the gradient estimate such that the error is larger than ε and ns be the number of instances with an error ≤ ε for one dimension of the gradient, and the result that we take is the median of the estimates, where we take n = ns + n f samples. The algorithm gives a wrong answer for each dimension  n  if ns ≤ 2 , since then the median is a sample such that the error is not bound by ε. Let p = 8/π2 be the success probability to draw a positive sample, as is the case of the amplitude estimation procedure. Since each instance of the phase estimation algorithm will independently return an estimate, the total failure probability is given by the union bound, i.e.,

2 h jnki − n p− 1 1 Pr ≤ D · Pr n ≤ ≤ D · e 2p ( 2 ) ≤ , (B.23) f ail s 2 3 which follows from the Chernoff inequality for a binomial variable with p > / n ≥ 2p ( D) = 1 2, which is given in our case. Therefore, by taking (p−1/2)2 log 3 16 log(3D) = O(log(3D)), we achieve a total failure probability of at most (8−π2/2)2 1/3.

This is sufficient to demonstrate the validity of the algorithm if

 Tr ρH˜h (B.24)

is known exactly. This is difficult to do because the probability distribution αh is not usually known apriori. As a result, we assume that the distribution will be learned B.2. Gradient estimation 168 empirically and to do so we will need to draw samples from the purified Gibbs states used as input. This sampling procedure will incur errors. To take such errors into account assume that we can obtain estimates Th of B.24 with precision δt, i.e.,

 Th − Tr ρH˜h ≤ δt. (B.25)

Under this assumption we can now bound the distance |αh − α˜ h| in the following way. Observe that

−Tr(ρH˜h) e Th |αh − α˜ h| = − −Tr(ρH˜h) T ∑h e ∑h h ˜ e−Tr(ρHh) T T T ≤ − h + h − h , (B.26) −Tr(ρH˜h) −Tr(ρH˜h) −Tr(ρH˜h) T ∑h e ∑h e ∑h e ∑h h and we hence need to bound the following two quantities in order to bound the error. First, we need a bound on

˜ −Tr(ρHh) −Th e − e . (B.27)

 For this, let f (s) := Th (1 − s) + Tr ρH˜h s, such that eq. B.27 can be rewritten as

Z 1 − f (1) − f (0) d − f (s) e − e = e ds 0 ds Z 1 − f (s) = f˙(s)e ds 0 Z 1   − f (s) = Tr ρH˜h − Th e ds 0 ≤ δe−mins f (s)

≤ δe−Tr(ρHh)+δ (B.28) and assuming δ ≤ log(2), this reduces to

˜ − f (1) − f (0) −Tr(ρHh) e − e ≤ 2δe . (B.29) B.2. Gradient estimation 169

Second, we need the fact that

˜ ˜ −Tr(ρHh) −Tr(ρHh) ∑e − ∑Th ≤ 2δ ∑e . (B.30) h h h

Using this, eq. B.26 can be upper bound by

˜ 2δe−Tr(ρHh) 1 1 + |Th| − −Tr(ρH˜h) −Tr(ρH˜h) −Tr(ρH˜h) ∑h e ∑h e (1 − 2δ)∑h e ˜ 2δe−Tr(ρHh) 4δ|T | ≤ + h , (B.31) −Tr(ρH˜h) −Tr(ρH˜h) ∑h e ∑h e where we used that δ ≤ 1/4. Note that

  −Tr(ρH˜h) −Tr(ρH˜h) 4δ|Th| ≤ 4δ e + 2δe

˜ = e−Tr(ρHh) 4δ + 8δ 2 ˜ ≤ e−Tr(ρHh)(4δ + 2δ) ˜ ≤ 6δe−Tr(ρHh), (B.32) which leads to a final error of

˜ e−Tr(ρHh) |αh − α˜ h| ≤ 8δ . (B.33) −Tr(ρH˜h) ∑h e

With this we can now bound the error in the expectation w.r.t. the faulty distribution for some function f (h) to be

−Tr(ρH˜h) ˜ f (h)e Eh( f (h)) − Eh( f (h)) ≤ 8δ ∑ −Tr(ρH˜h) h ∑h e ≤ 8δ max f (h). (B.34) h

We can hence use this in order to estimate the error introduced in the first term of B.3. Approach 2: Divided Differences 170 eq. B.15 through errors in the distribution {αh} as

˜ Eh[Eh,p]Tr(ρvp) − E[Eh,pTr(ρvp)] ≤ 8δ max|Eh,pTr(ρvp)| h

≤ 8δ max|Eh,p|, (B.35) h,p where we used in the last step the unitarity of vk and the Von-Neumann trace in- equality. For an final error of ε, we hence choose δt = ε/[16maxh,p |Eh,p|] to ensure that this sampling error incurrs at most half the error budget of ε. Thus we ensure

δ ≤ 1/4 if ε ≤ 4maxh,p |Eh,p|. We can improve the query complexity of estimating the above expectation by values by using amplitude amplification, sice we obtain the measurement via a Hadamard test. For this case we require only O(maxh,p |Eh,p|/ε) samples in order to achieve the desired accuracy from the sampling. Noting that we might not be able to even access H˜h without any error, we can deduce that the error of the individual ˜ terms of Hh for an ε-error in the final estimate must be bounded by δtvkθk1, where with abuse of notation, δt now denotes the error in the estimates of Eh,k. Even tak- ing this into account, the evaluation of this contribution is however dominated by the second term, and hence can be neglected in the analysis.

B.3 Approach 2: Divided Differences

In this section we develop a scheme to train a quantum Boltzmann machine us- ing divided difference estimates for the relative entropy error. The idea for this is straightforward: First we construct an interpolating polynomial from the data. Sec- ond, an approximation of the derivative at tany point can be then obtained by a direct differentiation of the interpolant. We assume in the following that we can simulate and evaluate Tr(ρ logσv). As this is generally non-trivial, and the error is typically large, we propose in the next section a different more specialised approach which, however, still allows us to train arbitrary models with the relative entropy objective. In order to proof the error of the gradient estimation via interpolation, we first B.3. Approach 2: Divided Differences 171 need to establish error bounds on the interpolating polynomial which can be ob- tained via the remainder of the Lagrange interpolation polynomial. The gradient error for our objective can then be obtained by as a combination of this error with a bound on the n + 1-st order derivative of the objective. We start by bounding the error in the polynomial approximation.

Lemma 18. Let f (θ) be the n + 1 times for which we want to approximate the gradient and let pn(θ) be the degree n Lagrange interpolation polynomial for points {θ1,θ2,...,θk,...,θn}. The gradient evaluated at point θk is then given by the interpolation polynomial

n ∂ p(θk) 0 = ∑ f (θ j)Ln, j(θk), (B.36) ∂θ j=0

0 where Ln, j is the derivative of the Lagrange interpolation polynomials Lµ, j(θ) := µ ∏ θ−θk , and the error is given by k=0 θ j−θk k6= j

f ( ) p ( ) 1 n ∂ θk ∂ n θk (n+1) − ≤ f (ξ(θk)) ∏(θ j − θk) , (B.37) ∂θ ∂θ (n + 1)! j=0 j6=k where ξ(θk) is a constant depending on the point θk at which we evaluated the gradient, and f (i) denotes the i-th derivative of f .

Note that θ is a point within the set of points at which we evaluate.

Proof. Recall that the error for the degree n Lagrange interpolation polynomial is given by 1 f (θ) − p (θ) ≤ f (n+1)(ξ )w(θ), (B.38) n (n + 1)! θ

n where w(θ) := ∏ (θ − θ j). We want to estimate the gradient of this, and hence j=1 need to evaluate

1 (n+1) 1 (n+1) ! ∂ f (θ) ∂ p (θ) (n+1)! f (ξθ+∆)w(θ + ∆) − (n+1)! f (ξθ )w(θ) − n ≤ lim . ∂θ ∂θ ∆→0 ∆ (B.39) B.3. Approach 2: Divided Differences 172

Now, since we do not necessarily want to estimate the gradient at an arbitrary point θ but indeed have the freedom to choose the point, we can set θ to be one of the n points at which we evaluate the function f (θ), i.e., θ ∈ {θi}i=1. Let this choice be given by θk, arbitrarily chosen. Then we see that the latter term vanishes since w(θk) = 0. Therefore we have

1 (n+1) ! ∂ f (θ ) ∂ p (θ ) (n+1)! f (ξθk+∆)w(θk + ∆) k − n k ≤ lim , (B.40) ∂θ ∂θ ∆→0 ∆ and noting that w(θk) contains one term (θk + ∆ − θk) = ∆ achieves the claimed result.

We will perform a number of approximation steps in order to obtain a form which can be simulated on a quantum computer more efficiently, and only then resolve to divided differences at this “lower level”. In detail we will perform the following steps. As described in the body of the thesis, we perform the following steps in order to obtain the gradient.

1. Approximate the logarithm via a Fourier-like approximation

logσv → logK,M σv, (B.41)

which yields a Fourier-like series ∑m cm exp(imπσv).

 ∂  2. Evaluate the gradient of Tr ∂θ ρ logK,M(σv) , yielding terms of the form

Z 1 ∂σ dse(ismπσv) v e(i(1−s)mπσv). (B.42) 0 ∂θ

3. Each term in this expansion can be evaluated separately via a sampling pro- cedure, i.e.,

Z 1   (ismπσv) ∂σv (i(1−s)mπσv) (ismπσv) ∂σv (i(1−s)mπσv) dse e ≈ Es e e . (B.43) 0 ∂θ ∂θ

∂σv 4. Apply a divided difference scheme to approximate the gradient ∂θ . B.3. Approach 2: Divided Differences 173

5. Use the Fourier series approach to aproximate the density operator σv by the 0 series of itself, i.e., σv ≈ F(σv) := ∑m0 cm0 exp(imπm σv).

6. Evaluate these terms conveniently via sample based Hamiltonian simulation and the Hadamard test.

In the following we will give concrete bounds on the error introduced by the ap- proximations and details of the implementation. The final result is then stated in Theorem 13. We first bound the error in the approximation of the logarithm and then use Lemma 37 of [175] to obtain a Fouries series approximation which is close to log(z). The Taylor series of

∞ k K1 k k+1 (x − 1) k+1 (x − 1) log(x) = ∑ (−1) = ∑ (−1) + RK1+1(x − 1), k=1 k k=1 k

K1+1 f (c) K1 for x ∈ (0,1) and where RK+1(z) = K! (z − c) z is the Cauchy remainder of the Taylor series, for −1 < z < 0. The error can hence be bounded as

zK1+1(1 − )K1 K1 α |RK +1(z)| = (−1) , 1 (1 + αz)K1+1 where we evaluated the derivatives of the logarithm and 0 ≤ α ≤ 1 is a parameter. 1−α Using that 1+αz ≥ 1+z (since z ≤ 0) and hence 0 ≤ 1+αz ≤ 1, we obtain the error bound |z|K1+1 |R (z)| ≤ (B.44) K1+1 1 + z Reversing to the variable x the error bound for the Taylor series, and assuming that 0 < δl < z and 0 < |1 − z| ≤ δu < 1, which is justified if we are dealing with sufficiently mixed states, then the approximation error is given by

K +1 (δl) 1 ! |RK1+1(z)| ≤ ≤ ε1. (B.45) δu

Hence in order to achieve the desired error ε1 we need

−1 log (ε1δu) K1 ≥ −1 . log((δl) ) B.3. Approach 2: Divided Differences 174

We hence can chose K1 such that the error in the approximation of the Taylor series is ≤ ε1/4. This implies we can make use of Lemma 37 of [175], and therefore obtain a Fourier series approximation for the logarithm. We will restate this Lemma here for completeness:

Lemma 19 (Lemma 37, [175]). Let f : R → C and δ,ε ∈ (0,1), and T( f ) := K k ∑k=0 akx be a polynomial such that | f (x) − T( f )| ≤ ε/4 for all x ∈ [−1+δ,1−δ]. M+ Then ∃c ∈ C2 1 : M iπm 2 x f (x) − ∑ cme ≤ ε (B.46) m=−M  l   m  4kak1 1 for all x ∈ [−1 + δ,1 − δ], where M = max 2 ln ε δ ,0 and kck1 ≤ kak1. Moreover, c can be efficiently calculated on a classical computer in time poly(K,M,log(1/ε)).

In order to apply this lemma to our case, we restrict the approximation rate to the range (δl,δu), where 0 < δl ≤ δu < 1. Therefore we obtain over this range a approximation of the following form.

Corollary 16. Let f : R → C be defined as f (x) = log(x), δ,ε1 ∈ (0,1), and k−1 k−1 K1 (−1) k (−1) K1 1 logK(1 − x) := ∑k=1 k x such that ak := k and kak1 = ∑k=1 k with u −1 log(4(ε1δ1 ) ) K1 ≥ −1 such that |log(x) − logK(x)| ≤ ε1/4 for all x ∈ [δl,δu]. Then log((δl) ) M+ ∃c ∈ C2 1 : M 1 iπm 2 x f (x) − ∑ cme ≤ ε1 (B.47) m=−M1  l   m  4kak1 1 for all x ∈ [δ ,δu], where M = max 2 ln ,0 and kck ≤ kak . l 1 ε1 1−δu 1 1 Moreover, c can be efficiently calculated on a classical computer in time poly(K1,M1,log(1/ε1)).

Proof. The proof follows straight forward by combining Lemma 19 with the ap- proximation of the logarithm and the range over which we want to approximate the function.

M iπm x In the following we denote with log (x) := 1 c e 2 , where we K,M ∑m=−M1 m keep the K-subscript to denote that classical computation of this approximation B.3. Approach 2: Divided Differences 175 is poly(K)-dependent. We can now express the gradient of the objective via this approximation as

M   1 ic m Z 1  isπm i(1−s)πm  ∂ m π σv ∂σv σv Tr ρ logK,M σv ≈ ∑ ds Tr ρe 2 e 2 . ∂θ 2 0 ∂θ m=−M1 (B.48) where we can evaluate each term in the sum individually and then classically post process the results, i.e., sum these up. In particular the latter can be evaluated as the expectation value over s, i.e.,

Z 1  isπm i(1−s)πm    isπm i(1−s)πm  σv ∂σv σv σv ∂σv σv ds Tr ρe 2 e 2 = Es∈[0,1] Tr ρe 2 e 2 , 0 ∂θ ∂θ (B.49) which we can evaluate separately on a quantum device. In the following we hence need to device a method to evaluate this expectation value.

∂σv First, we will expand the gradient using a divided difference formula such that ∂θ is approximated by the Lagrange interpolation polynomial of degree µ − 1, i.e.,

µ ∂σv 0 (θ) ≈ ∑ σv(θ j)L µ, j(θ), ∂θ j=0 where µ θ − θk Lµ, j(θ) := ∏ . k=0 θ j − θk k6= j Note that the order µ is free to chose, and will guarantee a different error in the solution of the gradient estimate as described prior in Lemma 18. Using this in the gradient estimation, we obtain a polynomial of the form (evaluated at θ j, i.e., the chosen points)

M µ 1 ic m h  isπm i(1−s)πm i m π 0 σv σv L , j(θ j) Tr ρe 2 σv(θ j)e 2 , (B.50) ∑ 2 ∑ µ Es∈[0,1] m=−M1 j=0 where each term again can be evaluated separately, and efficiently combined via classical post processing. Note that the error in the Lagrange interpolation poly- nomial decreases exponentially fast, and therefore the number of terms we use is B.3. Approach 2: Divided Differences 176 sufficiently small to do so. Next, we need to deploy a method to evaluate the above expressions. In order to do so, we implement σv as a Fourier series of itself, i.e.,

σv = arcsin(sin(σvπ/2)/(π/2)), which we will then approximate similar to the ap- proach taken in Lemma 19. With this we obtain the following result.

0 M2 iπm x/2 log(4/ε2) Lemma 20. Let δ,ε2 ∈ (0,1), and x˜ := ∑ 0 c˜m0 e with K2 ≥ −1 and m =−M2 log(δu )   q  4 −1 −1 2M+1 M ≥ log (2logδu ) and x ∈ [δ ,δu]. Then ∃c˜ ∈ : 2 ε2 l C

|x − x˜| ≤ ε2 (B.51)

for all x ∈ [δl,δu], and kck1 ≤ 1. Moreover, c˜ can be efficiently calculated on a classical computer in time poly(K2,M2,log(1/ε2)).

Proof. Invoking the technique used in [175], we expand

0 K2  0 2k +1 −2k0 2k z arcsin(z) = ∑ 2 0 0 + RK2+1(z), k0=0 k 2k + 1

wher RK2+1 is the remainder as before. For 0 < z ≤ δu ≤ 1/2, remainder can be K +1 ! |δu| 2 bound by |RK2+1| ≤ 1/2 ≤ ε2/2, which gives the bound

log(4/ε2) K2 ≥ −1 . log(δu )

We then approximate

 l l   l i m0 l ix(2m0−l) sin (x) = ∑ (−1) 0 e (B.52) 2 m0=0 m by  l bl/2c+M2   l i m0 l ix(2m0−l) sin (x) ≈ ∑ (−1) 0 e , (B.53) 2 0 m m =dl/2e−M2 which induces an error of ε2/2 for the choice

  q  4 −1 −1 M2 ≥ log (2logδu ) . ε2 B.3. Approach 2: Divided Differences 177

This can be seen by using Chernoff’s inequality for sums of binomial coefficients, i.e., l   2 l 2M2 −l − l ∑ 2 0 ≤ e , 0 m m =dl/2+M2e and chosing M appropriately. Finally, defining f (z) := arcsin(sin(zπ/2)/(π/2)), as 0 ˜ K2 2k +1 well as f1 := ∑k0=0 bk0 sin (zπ/2) and

K2  l bl/2c+M2   ˜ i m0 l ix(2m0−l) f2(z) := ∑ bk0 ∑ (−1) 0 e , (B.54) 0 2 0 m k =0 m =dl/2e−M2 and observing that

˜ ˜ ˜ ˜ f − f2 ∞ ≤ f − f1 ∞ + f1 − f2 ∞ ,

iπm0z/2 yields the final error of ε2 for the approximation z ≈ z˜ = ∑m0 c˜m0 e .

Note that this immediately leads to an ε2 error in the spectral norm for the approximation M 2 0 iπm σv/2 σv − ∑ c˜m0 e ≤ ε2, (B.55) 0 m =−M2 2 where σv is the reduced density matrix.

Since our final goal is to estimate Tr(∂θ ρ logσv), with a variety of σv(θ j) using the divided difference approach, we also need to bound the error in this estimate which we introduce with the above approximations. Bounding the derivative with respect to the remainder can be done by using the truncated series expansion and bounding the gradient of the remainder. This yields the following result.

Lemma 21. For the of the parameters M1,M2,K1,L, µ,∆,s given in eq. (B.89-B.96), and ρ,σv being two density matrices, we can estimate the gradient of the relative entropy such that

Tr( log ) − Tr log ˜  ≤ , (B.56) ∂θ ρ σv ∂θ ρ K1,M1 σv ε B.3. Approach 2: Divided Differences 178 where the function Tr log ˜  evaluated at is defined as ∂θ ρ K1,M1 σv θ

" M M µ # 1 2 ic c 0 m h  isπm iπm0 i(1−s)πm i m ˜m π 0 σv σv(θ j) σv Re ∑ ∑ ∑ Lµ, j(θ)Es∈[0,1] Tr ρe 2 e 2 e 2 0 2 m=−M1 m =−M2 j=0 (B.57)

The gradient can hence be approximated to error ε with O(poly(M1,M2,K1,L,s,∆, µ)) computation on a classical computer and using only the Hadamard test, Gibbs state preparation and LCU on a quantum device.

Notably the expression in (B.57) can now be evaluated with a quantum- classical hybrid device by evaluating each term in the trace separately via a Hadamard test and, since the number of terms is only polynomial, and then evalu- ating the whole sum efficiently on a classical device.

Proof. For the proof we perform the following steps. Let σi(ρ) be the singular values of ρ, which are equivalently the eigenvalues since ρ is Hermitian. Then observe that the gradient can be separated in different terms, i.e., let logs be K1,M1 σv the approximation as given in (B.57) for a finite sample of the expectation values

Es, then we have

Tr( log ) − Tr logs ˜  ≤ ∂θ ρ σv ∂θ ρ K1,M1 σv ≤ ( ) · [log − logs ˜ ] ∑σi ρ ∂θ σv K1,M1 σv i ≤ ( ) · [log − log ] ∑σi ρ ∂θ σv K1,M1 σv i + [log − log ˜ ] + [log ˜ − logs ˜ ]  (B.58) ∂θ K1,M1 σv K1,M1 σv ∂θ K1,M1 σv K1,M1 σv where the second step follows from the Von-Neumann trace inequality and the terms are (1) the error in approximating the logarithm, (2) the error introduced by the di- vided difference and the approximation of σv as a Fourier-like series, and (3) is the finite sampling approximation error. We can now bound the different term sepa- rately, and start with the first part which is in general harder to estimate. We partition the bound in three terms, corresponding to the three different approximations taken B.3. Approach 2: Divided Differences 179 above.

[log − log ] ≤ ∂θ σv K1,M1 σv

∞ (−1)k K1 (−1)k ∞ ≤ ∂ σ k + ∂ b(k) sinl(σ π/2) θ ∑ k v θ ∑ k ∑ l v k=K1+1 k=1 l=L

K1 (−1)k ∞  i l + ∂ b(k) (−1)mei(2m−l)σvπ/2 θ ∑ k ∑ l 2 ∑ k=1 l=L m∈[0,dl/2e−M1]∪[bl/2c+M1,l]

The first term can be bound in the following way:

∞ K1 k−1 kσvk ≤ ∑ kσvk = , (B.59) 1 − kσvk k=K1+1 and, assuming kσvk < 1, we hence can set

K1 ≥ log((1 − kσvk)ε/9)/log(kσvk) (B.60) appropriately in order to achieve an ε/9 error. The second term can be bound by assuming that kσvπk < 1, and chosing

! ε

9 K ∂σv π ∂θ L ≥ log , log(kσvkπ) which we derive by observing that

K 1 ∞ (k) l−1 π ∂σv ≤ ∑ ∑ bl l sin (σvπ/2) · (B.61) k=1 k l=L 2 ∂θ K ∞ 1 (k) l−1 ∂σv < ∑ ∑ bl π kσvπk · (B.62) k=1 k l=L+1 ∂θ K 1 L ∂σv ≤ ∑ π kσvπk · , (B.63) k=1 k ∂θ where we used in the second step that l < 2l. Finally, the last term can be bound B.3. Approach 2: Divided Differences 180 similarly, which yields

K L 1 2 (k) −2(M1) /l π ∂σv ≤ ∑ ∑ bl e · l · (B.64) k=1 k l=1 2 ∂θ K L L 2 (k) −2(M1) /L π ∂σv ≤ ∑ ∑ bl e (B.65) k=1 k l=1 2 ∂θ K L −2(M )2/L π ∂σv KLπ −2(M )2/L ∂σv ≤ ∑ e 1 ≤ e 1 , (B.66) k=1 k 2 ∂θ 2 ∂θ and we can hence chose

v u  ∂σv  u 9 K1Lπ u ∂θ M1 ≥ Llog  t 2ε in order to decrease the error to ε/3 for the first term in (B.58). For the second term, first note that with the notation we chose, [log − log ˜ ] ∂θ K1,M1 σv K1,M1 σv is the difference between the log-approximation where the gradient of σv is still ex- act, i.e., (B.48), and the version where we approximate the gradient via divided differences and the linear combination of unitaries, given in (B.57). Recall that the first level approximation was given by

M1 1   icmmπ Z isπm ∂σv i(1−s)πm ∑ ds Tr ρe 2 σv e 2 σv , 2 0 ∂θ m=−M1 where we went from the expectation value formulation back to the integral formu- lation to avoid consideration of potential errors due to sampling.

Bounding the difference hence yields one term from the divided difference approximation of the gradient and an error from the Fourier series, which we can both bound separately. Denoting ∂ p˜(θk)/∂θ as the divided difference and the LCU 1 approximation of the Fourier series , and with ∂ p(θk)/∂θ the divided difference

1which effectively means that we approximate the coefficients of the interpolation polynomial B.3. Approach 2: Divided Differences 181 without approximation via the Fourier series, we hence have

[log − log ˜ ] ≤ (B.67) ∂θ K1,M1 σv K1,M1 σv

M1 1     icmmπ Z isπm ∂σv ∂ p˜(θk) i(1−s)πm ≤ ∑ ds Tr ρe 2 σv − e 2 σv (B.68) 2 0 ∂θ ∂θ m=−M1 Z 1 M1π kak1 ∂σv ∂ p˜(θk) ≤ ds ∑σi(ρ) − (B.69) 2 0 i ∂θ ∂θ Z 1   M1π kak1 ∂σv ∂ p(θk) ∂ p(θk) ∂ p˜(θk) ≤ ds ∑σi(ρ) − + − (B.70) 2 0 i ∂θ ∂θ ∂θ ∂θ ! M kak π Z 1 ∂ µ+1σ  ∆ µ max (µ − k)! µ ≤ 1 1 ds ( ) v k + | 0 ( )|k − ˜ k ∑σi ρ µ+1 ∑ Lµ, j θ j σv σv 2 0 i ∂θ µ − 1 (µ + 1)! j=0 (B.71)  µ+1  µ  M1 kak1 π ∂ σv ∆ µ! 0 ≤ + µ L (θ j) ε2 , (B.72) 2 ∂θ µ+1 µ − 1 (µ + 1)! µ, j ∞

K1 where kak1 = ∑k=1 1/k, and we used in the last step the results of Lemma 20. Under appropriate assumptions on the grid-spacing for the divided difference scheme ∆ and the number of evaluated points µ as well as a bound on the µ + 1-st derivative of σv w.r.t. θ, we can hence also bound this error. In order to do so, we need to  −H −H analyze the µ +1-st derivative of σv = Trh e /Z with Z = Tr e . For this we have

µ+1 p  −H ∂ µ+1σ µ + 1 ∂ Tr e ∂ µ+1−pZ−1 v ≤ h µ+1 ∑ p µ+1−p ∂θ p=1 p ∂θ ∂θ p  −H µ+1−p −1 ∂ Trh e ∂ Z ≤ 2µ+1 max (B.73) p µ+1−p p ∂θ ∂θ

We have that

p  −H q −H ∂ Trh e ∂ e ≤ dim(H ) (B.74) p h q ∂θ ∂θ

n where dim(Hh) = 2 h . In order to bound this, we take advantage of the infinitesimal B.3. Approach 2: Divided Differences 182 expansion of the exponent, i.e.,

q −H q r ∂ e ∂ −H/r = lim e q q r→∞ ∏ ∂θ ∂θ j=1 ! ∂ qe−H/r r ∂ q−1e−H/r ∂e−H/r r = lim e−H/r + e−H/r + ... r→∞ q ∏ q−1 ∏ ∂θ j=2 ∂θ ∂θ j=3  q   q ∂H/r q 1 −H ∂H −H ≤ lim · r + O e = e , (B.75) r→∞ ∂θ r ∂θ where the last step follows from the fact that we have rq terms and that we used that the error introduced by the commutations above will be of O(1/r). Observing that

∂θi H = ∂θi ∑ j θ jHj = Hi and assuming that λmax is the largest singular eigenvalue q −H of H, we can hence bound this by λmax e .

p  −H q −H ∂ Trh e ∂ e ≤ dim(H ) p h q ∂θ ∂θ p  −H ≤ λmaxdim(Hh) Trh e , (B.76)

µ+1−p −1 µ+1−p ∂ Z (µ + 1 − p)!|λmax| −H ≤ Tr e ∂θ µ+1−p Zµ+2−p µ + 1 − pµ+1−p e ≤ |λ |µ+1−pTre−H eZ Z max µ + 1 − pµ+1−p = e|λ |µ+1−p (B.77) eZ max

We can therefore find a bound for (B.73) as

µ+1  + 1 − pµ+1−p ∂ σv µ+1+nh µ+1  −H µ ≤ e2 λ Trh e max . (B.78) ∂θ µ+1 max p eZ B.3. Approach 2: Divided Differences 183

Plugging this result into the bound from above yields

[log − log ˜ ] ∂θ K1,M1 σv K1,M1 σv ! M kak  + 1 − pµ+1−p  ∆ µ 1 1 1 π µ+1+nh µ+1  −H µ ≤ e2 λ Trh e max 2 max p eZ µ − 1 µ + 1

M1 kak1 π  0  + µ L (θ j) ε2 , (B.79) 2 µ, j ∞

Note that under the reasonable assumption that 2 ≤ µ  Z, the maximum is achieved for p = µ + 1, and we hence obtain the upper bound

M kak   ∆ µ 1  1 1 π nh µ+1  −H 0 2 e(2|λmax|) Trh e + µ L (θ j) ε2 2 µ − 1 µ + 1 µ, j ∞ M kak   ∆ µ  1 1 π nh µ+1  −H 0 ≤ 2 e(2|λmax|) Trh e + µ L (θ j) ε2 , 2 µ − 1 µ, j ∞ (B.80) and we can hence obtain a bound on µ, the grid point number, in order to achieve an error of ε/6 > 0 for the former term, which is given by

   2 −H  6M kak e |λmax|πkTrh[e ]k log 2nh 1 1   ε  µ ≥ (|λmax|∆)expW  , (B.81)   2λmax∆  where W is the Lambert function, also known as product-log function, which gener- ally grows slower than the logarithm in the asymptotic limit. Note that µ can hence be lower bounded by

2  −H !   6M1 kak e |λmax|π Trh e M Λ µ ≥ n + log 1 := n + log 1 . (B.82) h ε h ε

For convenience, let us choose ε such that nh +log(M1Λ/ε) is an integer. We do this simply to avoid having to keep track of ceiling or floor functions in the following discussion where we will choose µ = nh + log(M1Λ/ε). For the second part, we will bound the derivative of the Lagrangian interpo- B.3. Approach 2: Divided Differences 184

µ   lation polynomial. First, note that L 0 (θ) = ∑ ∏ θ−θk 1 for µ, j l=0;l6= j k=0;k6= j,l θ j−θk θ j−θl a chosen discretization of the space such that θk − θ j = (k − j)∆/µ can be bound by using a central difference formula, such that we use an uneven number of points (i.e. we take µ = 2κ + 1 for positive integer κ) and chose the point m at which we evaluate the gradient as the central point of the mesh. Note that in this case he have that for µ ≥ 5 and θm being the parameters at the midpoint of the stencil

|θ − θ | 1 (κ!)2 µ 1 0 ≤ m k ≤ Lµ, j ∞ ∑ ∏ 2 ∑ l6= j k6= j,l |θ j − θk| |θl − θ j| (κ!) ∆ l6= j |l − j| 2µ κ 1 ≤ ∑ ∆ l=1 l 2µ  Z κ−1 1  2µ 5µ ≤ 1 + d` = (1 + log((µ − 3)/2)) ≤ log(µ/2), ∆ 1 ` ∆ ∆ (B.83) where the last inequality follows from the fact that µ ≥ 5 and 1 + ln(5/2) < (5/2)ln(5/2). Now, plugging in the µ from (B.94), we find that this error is bound by

    M1Λ  M1Λ  5nh + 5log   nh + log 0 ε M1Λ ε L ≤ log(nh/2+log /2) = O˜  , µ, j ∞ ∆ ε ∆ (B.84) If we want an upper bound of ε/6 for the second term of the error in (B.80), we hence require

ε ε2 ≤ 0 15M1 kak1 πµ Lµ, j(θ j) ∞ ε∆ ≤ 2 15M1kak1π (nh + log(M1Λ/ε)) log((nh/2) + log(M1Λ/ε)/2) ε∆ ≤ 2 (B.85) 15M1kak1πµ log((µ − 1)/2)

We hence obtain that the approximation error due to the divided differences and

Fourier series approximation of σv is bounded by ε/3 for the above choice of ε2 B.3. Approach 2: Divided Differences 185 and µ. This bounds the second term in (B.72) by ε/3.

Finally, we need to take into account the error [log ˜ − logs ˜ ] ∂θ K1,M1 σv K1,M1 σv which we introduce through the sampling process, i.e., through the finite sample estimate of Es[·] here indicated with the superscript s over the logarithm. Note that this error can be bound straight forward by (B.57). We only need to bound the error introduced via the finite amount of samples we take, which is a well- known procedure. The concrete bounds for the sample error when estimating the expectation value are stated in the following lemma.

Lemma 22. Let σm be the sample standard deviation of the random variable

h  isπm iπm0 i(1−s)πm i ˜ 2 σv 2 σv(θ j) 2 σv Es∈[0,1] Tr ρe e e , (B.86) such that the sample standard deviation is given by σ = √σm . Then with probability k k at least 1−δs, we can obtain an estimate which is within εsσm of the mean by taking 4 k = 2 samples for each sample estimate and taking the median of O(log(1/δs)) εs such samples.

4 Proof. From Chebyshev’s inequality taking k = 2 samples implies that with prob- εs ability of at least p = 3/4 each of the mean estimates is within 2σk = εsσm from the true mean. Therefore, using standard techniques, we take the median of

O(log(1/δs)) such estimates which gives us with probability 1 − δs an estimate of the mean with error at most εsσm, which implies that we need to repeat the pro-  1  1  cedure O 2 log times. εs δs

We can then bound the error of the sampling step in the final estimate, denoting with εs the sample error, as

M1 M2 µ icmcm0 mπ 0 ∑ ∑ ∑ |Lµ, j(θ)|εsσm 0 2 m=−M1 m =−M2 j=0 2 µ  5kak M1εsσmπµ log ε ≤ 1 2 ≤ , (B.87) ∆ 3 B.3. Approach 2: Divided Differences 186

We hence find that for

ε∆ ε ≤ s 2 µ  15kak1 M1σmπµ log 2 ε∆ ≤ 2 15M1kak1σmπ (nh + log(M1Λ/ε)) log((nn/2) + log(M1Λ/ε)/2) ε∆ ≤ 2 (B.88) 15M1kak1σmπµ log(µ/2) also the last term in (B.58) can be bounded by ε/3, which together results in an overall error of ε for the various approximation steps, which concludes the proof.

Notably all quantities which occure in our bounds are only polynomial in the number of the qubits. The lower bounds for the choice of parameters are summa- B.3. Approach 2: Divided Differences 187 rized in the following.

v u  ∂σv  u 9 K1Lπ u ∂θ M1 ≥ Llog  (B.89) t 2ε   q  4 −1 −1 M2 ≥ log (2logδu ) (B.90) ε2

K1 ≥ log((1 − kσvk)ε/9)/log(kσvk) (B.91)

log(4/ε2) K2 ≥ −1 (B.92) log(δu ) ! ε log 9 K ∂σv π 1 ∂θ L ≥ (B.93) log(kσvkπ) 2  −H ! 6M1 kak e |λmax|π Trh e µ ≥ n + log 1 := n + log(M Λ/ε) (B.94) h ε h 1 ε∆ ε2 ≤ 2 15M1kak1π (nh + log(M1Λ/ε)) log((nh/2) + log(M1Λ/ε)/2) ε∆ ≤ 2 (B.95) 15M1kak1πµ log((µ − 1)/2) ε∆ εs ≤ 2 15M1kak1σmπ (nh + log(M1Λ/ε)) log((nn/2) + log(M1Λ/ε)/2) ε∆ ≤ 2 (B.96) 15M1kak1σmπµ log(µ/2)

B.3.1 Operationalising

In the following we will make use of two established subroutines, namely sam- ple based Hamiltonian simulation (aka the LMR protocol) [13], as well as the Hadamard test, in order to evaluate the gradient approximation as defined in (B.57). In order to hence derive the query complexity for this algorithm, we only need to multiply the cost of the number of factors we need to evaluate with the query com- plexity of these routines. For this we will rely on the following result.

Theorem 17 (Sample based Hamiltonian simulation [57]). Let 0 ≤ εh ≤ 1/6 be an error parameter and let ρ be a density for which we can obtain multiple copies −iρt through queries to a oracle Oρ . We can then simulate the time evolution e up B.3. Approach 2: Divided Differences 188

2 to error εh in trace norm as long as εh/t ≤ 1/(6π) with Θ(t /εh) copies of ρ and hence queries to Oρ .

We in particularly need to evaluate terms of the form

0  isπm iπm ( ) i(1−s)πm  Tr ρe 2 σv e 2 σv θ j e 2 σv (B.97)

Note that we can simulate every term in the trace (except ρ) via the sample based

Hamiltonian simulation approach to error εh in trace norm. This will introduce a additional error which we need to take into account for the analysis. Let U˜i,i ∈ ˜ {1,2,3} be the unitaries such that Ui −Ui ∗ ≤ εh where the Ui are corresponding isπm iπm0 i(1−s)πm σv σv(θ ) σv to the factors in (B.97), i.e., U1 := e 2 , U2 := e 2 j , and U3 := e 2 .

We can then bound the error as follows. First note that U˜i ≤ U˜i −Ui + kUik ≤

1 + εh, using Theorem 17 and the fact that the spectral norm is upper bounded by the trace norm.

 Tr(ρU1U2U3) − Tr ρU˜1U˜2U˜3 ≤  = Tr ρU1U2U3 − ρU˜1U˜2U˜3

≤ U1U2U3 −U˜1U˜2U˜3

≤ U1 −U˜1 U˜2 U˜3 + U2 −U˜2 U˜3 + U3 −U˜3 ˜ 2 ˜ ˜ ≤ U1 −U1 ∗ (1 + εh) + U2 −U2 ∗ (1 + εh) + U3 −U3 ∗ 2 ≤ εh(1 + εh) + εh(1 + εh) + εh = O(εh), (B.98)

neglecting higher orders of εh, and where in the first step we applied the Von- Neumann trace inequality and the fact that ρ is Hermitian, and in the last step we 2 used the results of Theorem 17. We hence require O((max{M1,M2}π) /εh) queries to the oracles for σv for the evaluation of each term in the multi sum in (B.57). Note that the Hadamard test has a query cost of O(1). In order to hence achieve an overall error of ε in the gradient estimation we require the error introduced by the sample based Hamiltonian simulation also to be of O(ε). In order to do so we require ε∆ εh ≤ O( 2 ), similar to the sample based error which yield the query 5kak1M1πµ log(µ/2) B.3. Approach 2: Divided Differences 189 complexity of max{M ,M }2 kak M π3µ2 log(µ/2) O 1 2 1 1 (B.99) ε∆ Adjusting the constants gives then the required bound of ε of the total error and the query complexity for the algorithm to the Gibbs state preparation procedure is consequentially given by the number of terms in (B.57) times the query complexity for the individual term, yielding

2 2 3 3 µ ! M1 M2 max{M1,M2} kak1 σmπ µ log 2 O 2 , (B.100) ε εs ∆ and classical precomputation polynomial in M1,M2,K1,L,s,∆, µ, where the differ- ent quantities are defines in eq. (B.89-B.96).

Taking into account the query complexity of the individual steps then results in Theorem 13. We proceed by proving this theorem next.

Proof of Theorem 13. The runtime follows straight forward by using the bounds derived in (B.99) and Lemma 21, and by using the bounds for the parameters

M1,M2,K1,L, µ,∆,s given in eq. (B.89-B.96). For the success probability for es- timating the whole gradient with dimensionality d, we can now again make use of the boosting scheme used in (B.23) to be

 ∂σv !  n2kak d kak3 σ 3 µ5 log3(µ/2)polylog ∂θ , h 1σm  1 m ε ε∆    O˜  log(d), (B.101)  ε3∆3   

where µ = nh + log(M1Λ/ε). Next we need to take into account the errors from the Gibbs state preparation given in Lemma 17. For this note that the error between the perfect Hamiltonian simula- tion of σv and the sample based Hamiltonian simulation with an erroneous density matrix denoted by U˜ , i.e., including the error from the Gibbs state preparation pro- B.3. Approach 2: Divided Differences 190 cedure, is given by

˜ −iσvt ˜ −iσ˜vt −iσ˜vt −iσvt U − e ≤ U − e + e − e

≤ εh + εGt (B.102)

where εh is the error of the sample based Hamiltonian simulation, which holds since the trace norm is an upper bound for the spectral norm, and kσv − σ˜vk ≤ εG is the error for the Gibbs state preparation from Theorem 15 for a d-sparse Hamiltonian, for a cost ! rN kHk  1  O˜ kHkd log log . z εG εG

From (B.98) we know that the error εh propagates nearly linear, and hence it suffices for us to take εG ≤ εh/t where t = O(max{M1,M2}) and adjust the constants εh ←

εh/2 in order to achieve the same precision ε in the final result. We hence require

r    ! N kH(θ)kmax{M1,M2} max{M1,M2} O˜ kH(θ)klog log (B.103) z εh εh and using the εh from before we hence find that this s bound by

! rN kH(θ)kn2   n2  O˜ kH(θ)klog h log h (B.104) z ε∆ ε∆ query complexity to the oracle of H for the Gibbs state preparation.

The procedure succeeds with probability at least 1 − δs for a single repetition for each entry of the gradient. In order to have a failure probability of the final algorithm of less than 1/3, we need to repeat the procedure for all D dimensions of the gradient and take for each the median over a number of samples. Let n f be as previously the number of instances of the one component of the gradient such that the error is larger than εsσm and ns be the number of instances with an error ≤ εsσm , and the result that we take is the median of the estimates, where we take n = ns + n f  n  samples. The algorithm gives a wrong answer for each dimension if ns ≤ 2 , since then the median is a sample such that the error is larger than εsσm. Let p = 1 − δs B.3. Approach 2: Divided Differences 191 be the success probability to draw a positive sample, as is the case of our algorithm. Since each instance of (recall that each sample here consists of a number of samples itself) from the algorithm will independently return an estimate for the entry of the gradient, the total failure probability is bounded by the union bound, i.e.,

n 1 2 h jnki − ((1−δs)− ) 1 Pr ≤ D · Pr n ≤ ≤ D · e 2(1−δs) 2 ≤ , (B.105) f ail s 2 3 which follows from the Chernoff inequality for a binomial variable with 1 − δs >

1/2, which is given in our case for a proper choice of δs < 1/2. Therefore, by

2−2δs taking n ≥ 2 log(3D) = O(log(3D)), we achieve a total failure probability of (1/2−δs) at least 1/3 for a constant, fixed δs. Note that this hence results in an multiplicative factor of O(log(D)) in the query complexity of (5.25).

The total query complexity to the oracle Oρ for a purified density matrix of the data

ρ and the Hamiltonian oracle OH is then given by

 ∂σv ! n2kak d log(d)kH(θ)kkak3 σ 3 µ5 log3(µ/2)polylog ∂θ , h 1σm , kH(θ)k r 1 m ε ε∆   N  O˜  ,  z ε3∆3   

(B.106) which reduces to ! rN DkH(θ)kdµ5α O˜ , (B.107) z ε3 hiding the logarithmic factors in the O˜ notation. Bibliography

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- sification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep . arXiv preprint arXiv:1312.5602, 2013.

[3] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

[4] Igor L Markov. Limits on fundamental limits to computation. Nature, 512(7513):147–154, 2014.

[5] Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018.

[6] Vittorio Giovannetti, , and Lorenzo Maccone. Quantum random access memory. Physical review letters, 100(16):160501, 2008.

[7] Vittorio Giovannetti, Seth Lloyd, and Lorenzo Maccone. Architectures for a quantum random access memory. Physical Review A, 78(5):052310, 2008. Bibliography 193

[8] Seth Lloyd. Universal quantum simulators. Science, pages 1073–1078, 1996.

[9] Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum . arXiv preprint arXiv:1412.3489, 2014.

[10] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3888–3898, 2017.

[11] B David Clader, Bryan C Jacobs, and Chad R Sprouse. Preconditioned quan- tum linear system algorithm. Physical review letters, 110(25):250504, 2013.

[12] Aram Harrow and Rolando P. La Placa. Limitations of quantum precondi- tioning using a sparse approximate inverse. Unpublished work, 2017.

[13] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10(9):631, 2014.

[14] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum sup- port vector machine for big data classification. Physical review letters, 113(13):130503, 2014.

[15] Ewin Tang. A quantum-inspired classical algorithm for recommendation sys- tems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 217–228, 2019.

[16] Iordanis Kerenidis and Anupam Prakash. Quantum recommendation sys- tems. In Proceedings of the 8th Innovations in Theoretical Computer Sci- ence Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer In- formatik, 2017.

[17] Ewin Tang. Quantum-inspired classical algorithms for principal component analysis and supervised clustering. arXiv preprint arXiv:1811.00414, 2018.

[18] Nai-Hui Chia, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired sublinear classical algorithms for solving low-rank linear systems. arXiv preprint arXiv:1811.04852, 2018. Bibliography 194

[19] Andras´ Gilyen,´ Seth Lloyd, and Ewin Tang. Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension. arXiv preprint arXiv:1811.04909, 2018.

[20] Nai-Hui Chia, Tongyang Li, Han-Hsuan Lin, and Chunhao Wang. Quantum- inspired classical sublinear-time algorithm for solving low-rank semidefinite programming via sampling approaches. arXiv preprint arXiv:1901.03254, 2019.

[21] Nai-Hui Chia, Andras´ Gilyen,´ Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 387–400, 2020.

[22] Danial Dervovic, Mark Herbster, Peter Mountney, Simone Severini, Na¨ıri Usher, and Leonard Wossnig. Quantum linear systems algorithms: a primer. arXiv preprint arXiv:1802.08227, 2018.

[23] Chunhao Wang and Leonard Wossnig. A quantum algorithm for sim- ulating non-sparse hamiltonians. Quantum Information & Computation, 20(7&8):597–615, 2020.

[24] Alessandro Rudi, Leonard Wossnig, Carlo Ciliberto, Andrea Rocchetto, Massimiliano Pontil, and Simone Severini. Approximating hamiltonian dy- namics with the nystrom¨ method. Quantum, 4:234, 2020.

[25] Carlo Ciliberto, Andrea Rocchetto, Alessandro Rudi, and Leonard Woss- nig. Fast quantum learning with statistical guarantees. arXiv preprint arXiv:2001.10477, 2020.

[26] Nathan Wiebe and Leonard Wossnig. Generative training of quantum boltz- mann machines with hidden units. arXiv preprint arXiv:1905.09902, 2019. Bibliography 195

[27] Hongxiang Chen, Leonard Wossnig, Simone Severini, Hartmut Neven, and Masoud Mohseni. Universal discriminative quantum neural networks. arXiv preprint arXiv:1805.08654, 2018.

[28] Marcello Benedetti, Edward Grant, Leonard Wossnig, and Simone Severini. Adversarial quantum circuit learning for pure state approximation. New Jour- nal of Physics, 21(4):043023, 2019.

[29] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum, 3:214, 2019.

[30] Shuxiang Cao, Leonard Wossnig, Brian Vlastakis, Peter Leek, and Edward Grant. Cost-function embedding and dataset encoding for machine learn- ing with parametrized quantum circuits. Physical Review A, 101(5):052309, 2020.

[31] I Rungger, N Fitzpatrick, H Chen, CH Alderete, H Apel, A Cowtan, A Pat- terson, D Munoz Ramo, Y Zhu, NH Nguyen, et al. Dynamical mean field theory algorithm and experiment on quantum computers. arXiv preprint arXiv:1910.04735, 2019.

[32] Andrew Patterson, Hongxiang Chen, Leonard Wossnig, Simone Severini, Dan Browne, and Ivan Rungger. Quantum state discrimination using noisy quantum neural networks. arXiv preprint arXiv:1911.00352, 2019.

[33] Jules Tilly, Glenn Jones, Hongxiang Chen, Leonard Wossnig, and Edward Grant. Computation of molecular excited states on ibmq using a discrim- inative variational quantum eigensolver. arXiv preprint arXiv:2001.04941, 2020.

[34] Howard E Haber. Notes on the matrix exponential and logarithm. 2018.

[35] Nicholas J Higham. Functions of matrices: theory and computation, volume 104. Siam, 2008. Bibliography 196

[36] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Physical review letters, 103(15):150502, 2009.

[37] Shantanav Chakraborty, Andras´ Gilyen,´ and Stacey Jeffery. The power of block-encoded matrix powers: Improved regression techniques via faster Hamiltonian simulation. In Proceedings of the 46th International Collo- quium on Automata, Languages, and Programming (ICALP 2019), volume 132, pages 33:1–33:14, 2019.

[38] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Physical review letters, 109(5):050505, 2012.

[39] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. Prediction by lin- ear regression on a quantum computer. Physical Review A, 94(2):022342, 2016.

[40] Guoming Wang. Quantum algorithm for linear regression. Physical review A, 96(1):012335, 2017.

[41] Iordanis Kerenidis and Anupam Prakash. Quantum recommendation sys- tems. In Innovations in Theoretical Computer Science (ITCS’17), 2017.

[42] Dan-Bo Zhang, Shi-Liang Zhu, and ZD Wang. Nonlinear regression based on a hybrid quantum computer. arXiv preprint arXiv:1808.09607, 2018.

[43] Scott Aaronson. Read the fine print. Nature Physics, 11(4):291–293, 2015.

[44] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.

[45] Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning the- ory, volume 1. Wiley New York, 1998.

[46] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K War- muth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989. Bibliography 197

[47] Felipe Cucker and Steve Smale. On the mathematical foundations of learn- ing. Bulletin of the American mathematical society, 39(1):1–49, 2002.

[48] Christopher M Bishop et al. and machine learning, vol- ume 1. springer New York, 2006.

[49] Carl Rasmussen and Chris Williams. Gaussian processes for machine learn- ing. Gaussian Processes for Machine Learning, 2006.

[50] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern anal- ysis. Cambridge university press, 2004.

[51] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, pages 592–617, 2013.

[52] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, page 5, 2007.

[53] Alex J Smola and Bernhard Scholkopf.¨ Sparse greedy matrix approximation for machine learning. pages 911–918. Morgan Kaufmann, 2000.

[54] Christopher KI Williams and Matthias Seeger. Using the nystrom¨ method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.

[55] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystrom¨ computational regularization. In Proceedings of the 28th Interna- tional Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1657–1665, Cambridge, MA, USA, 2015. MIT Press.

[56] Alessandro Rudi, Leonard Wossnig, Carlo Ciliberto, Andrea Rocchetto, Massimiliano Pontil, and Simone Severini. Approximating Hamiltonian dy- namics with the Nystrom¨ method. Quantum, 4:234, 2020. Bibliography 198

[57] Shelby Kimmel, Cedric Yen-Yu Lin, Guang Hao Low, Maris Ozols, and Theodore J Yoder. Hamiltonian simulation with optimal sample complex- ity. npj Quantum Information, 3(1):13, 2017.

[58] Tongyang Li, Shouvanik Chakrabarti, and Xiaodi Wu. Sublinear quantum al- gorithms for training linear and kernel-based classifiers. In Kamalika Chaud- huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3815–3824, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

[59] L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, E De Vito, and Alessandro Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.

[60] Harry Hochstadt. Integral equations, volume 91. John Wiley & Sons, 2011.

[61] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.

[62] Alex B Grilo and Iordanis Kerenidis. Learning with errors is easy with quan- tum samples. arXiv preprint arXiv:1702.08255, 2017.

[63] Varun Kanade, Andrea Rocchetto, and Simone Severini. Learning dnfs un- der product distributions via {\mu}-biased quantum fourier sampling. arXiv preprint arXiv:1802.05690, 2018.

[64] Yu Tong, Dong An, Nathan Wiebe, and Lin Lin. Fast inversion, precondi- tioned quantum linear system solvers, and fast evaluation of matrix functions. arXiv preprint arXiv:2008.13295, 2020.

[65] Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Informa- tion Processing Systems, pages 3778–3787, 2017. Bibliography 199

[66] Alon Gonen, Francesco Orabona, and Shai Shalev-Shwartz. Solving ridge regression using sketched preconditioned svrg. In International Conference on Machine Learning, pages 1397–1405, 2016.

[67] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017.

[68] Gregory E Fasshauer and Michael J McCourt. Stable evaluation of gaussian radial basis function interpolants. SIAM Journal on Scientific Computing, 34(2):A737–A762, 2012.

[69] David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.

[70] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo al- gorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006.

[71] Michael W Mahoney. Lecture notes on randomized linear algebra. arXiv preprint arXiv:1608.04481, 2016.

[72] Jeremy Adcock, Euan Allen, Matthew Day, Stefan Frick, Janna Hinch- liff, Mack Johnson, Sam Morley-Short, Sam Pallister, Alasdair Price, and Stasja Stanisic. Advances in quantum machine learning. arXiv preprint arXiv:1512.02900, 2015.

[73] Damian S. Steiger and Matthias Troyer. Racing in parallel: Quantum versus classical. In Quantum Machine Learning Workshop, Waterloo, CA, August 2016. Perimeter Institute for theoretical Physics.

[74] Oded Regev and Liron Schiff. Impossibility of a quantum speed-up with a faulty oracle. In International Colloquium on Automata, Languages, and Programming, pages 773–781. Springer, 2008. Bibliography 200

[75] Srinivasan Arunachalam, Vlad Gheorghiu, Tomas Jochym-O’Connor, Michele Mosca, and Priyaa Varshinee Srinivasan. On the robustness of bucket brigade quantum ram. New Journal of Physics, 17(12):123010, 2015.

[76] Richard P Feynman. Simulating physics with computers. International jour- nal of theoretical physics, 21(6):467–488, 1982.

[77] Richard P Feynman. Quantum mechanical computers. Foundations of physics, 16(6):507–531, 1986.

[78] Peter W Shor. Polynomial-time algorithms for prime factorization and dis- crete logarithms on a quantum computer. SIAM Review, 41(2):303–332, 1999.

[79] Dorit Aharonov and Amnon Ta-Shma. Adiabatic quantum state generation and statistical zero knowledge. In Proceedings of the 35th Annual ACM Sym- posium on Theory of Computing (STOC 2003), pages 20–29. ACM, 2003.

[80] Dominic W Berry, Graeme Ahokas, Richard Cleve, and Barry C Sanders. Efficient quantum algorithms for simulating sparse hamiltonians. Communi- cations in Mathematical Physics, 270(2):359–371, 2007.

[81] Dominic W Berry and Andrew M Childs. Black-box Hamiltonian simulation and unitary implementation. Quantum Information & Computation, 12(1- 2):29–62, 2012.

[82] Dominic W Berry, Andrew M Childs, Richard Cleve, Robin Kothari, and Rolando D Somma. Exponential improvement in precision for simulating sparse Hamiltonians. Forum of Mathematics, Sigma, 5, 2017.

[83] Andrew M Childs. On the relationship between continuous-and discrete-time quantum walk. Communications in Mathematical Physics, 294(2):581–603, 2010. Bibliography 201

[84] Andrew M Childs and Nathan Wiebe. Hamiltonian simulation using linear combinations of unitary operations. Quantum Information & Computation, 12(11-12):901–924, 2012.

[85] David Poulin, Angie Qarry, Rolando Somma, and Frank Verstraete. Quantum simulation of time-dependent Hamiltonians and the convenient illusion of hilbert space. Physical Review Letters, 106(17):170501, 2011.

[86] Nathan Wiebe, Dominic W Berry, Peter Høyer, and Barry C Sanders. Sim- ulating quantum dynamics on a quantum computer. Journal of Physics A: Mathematical and Theoretical, 44(44):445308, 2011.

[87] Guang Hao Low and Isaac L Chuang. Hamiltonian simulation by qubitiza- tion. Quantum, 3:163, 2019.

[88] Dominic W Berry and Leonardo Novo. Corrected quantum walk for opti- mal Hamiltonian simulation. Quantum Information & Computation, 16(15- 16):1295–1317, 2016.

[89] Guang Hao Low and Isaac L Chuang. Hamiltonian simulation by uniform spectral amplification. arXiv preprint arXiv:1707.05391, 2017.

[90] Dominic W Berry, Andrew M Childs, and Robin Kothari. Hamiltonian sim- ulation with nearly optimal dependence on all parameters. In Proceedings of the 56th Annual Symposium on Foundations of Computer Science (FOCS 2015), pages 792–809. IEEE, 2015.

[91] Guang Hao Low and Isaac L Chuang. Optimal Hamiltonian simulation by quantum signal processing. Physical Review Letters, 118(1):010501, 2017.

[92] Guang Hao Low. Hamiltonian simulation with nearly optimal dependence on spectral norm. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC 2019), pages 491–502, 2019. Bibliography 202

[93] Andrew M Childs, Robin Kothari, and Rolando D Somma. Quantum algo- rithm for systems of linear equations with exponentially improved depen- dence on precision. SIAM Journal on Computing, 46(6):1920–1950, 2017.

[94] Leonard Wossnig, Zhikuan Zhao, and Anupam Prakash. Quantum linear sys- tem algorithm for dense matrices. Physical Review Letters, 120(5):050502, 2018.

[95] Andrew M. Childs and Robin Kothari. Limitations on the simulation of non- sparse Hamiltonians. Quantum Information & Computation, 10(7&8):669– 684, 2010.

[96] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge univer- sity press, 1990.

[97] Andras´ Gilyen,´ Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum sin- gular value transformation and beyond: exponential improvements for quan- tum matrix arithmetics. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC 2019), pages 193–204. ACM, 2019.

[98] Nicholas J. Higham. The scaling and squaring method for the matrix expo- nential revisited. SIAM J. Matrix Anal. Appl., 26(4):1179–1193, April 2005.

[99] Nicholas J. Higham. The scaling and squaring method for the matrix expo- nential revisited. SIAM Rev., 51(4):747–764, November 2009.

[100] Awad H Al-Mohy and Nicholas J Higham. A new scaling and squaring al- gorithm for the matrix exponential. SIAM Journal on Matrix Analysis and Applications, 31(3):970–989, 2009.

[101] Awad H Al-Mohy and Nicholas J Higham. Computing the action of the matrix exponential, with an application to exponential integrators. SIAM journal on scientific computing, 33(2):488–511, 2011. Bibliography 203

[102] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tamas´ Sarlos.´ Faster least squares approximation. Numer. Math., 117(2):219–249, February 2011.

[103] Lorenzo Orecchia, Sushant Sachdeva, and Nisheeth K. Vishnoi. Approximat- ing the exponential, the lanczos method and an O(m)-time˜ spectral algorithm for balanced separator. In Proceedings of the Forty-Fourth Annual ACM Sym- posium on Theory of Computing, STOC ’12, page 1141–1160, New York, NY, USA, 2012. Association for Computing Machinery.

[104] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Com- puting, STOC ’04, page 81–90, New York, NY, USA, 2004. Association for Computing Machinery.

[105] Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011.

[106] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158–183, 2006.

[107] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nystrom¨ method. Journal of Machine Learning Research, 13(Apr):981– 1006, 2012.

[108] Petros Drineas and Michael W. Mahoney. On the nystrom¨ method for ap- proximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res., 6:2153–2175, December 2005.

[109] CKI. Williams, CE. Rasmussen, A. Schwaighofer, and V. Tresp. Observa- tions on the nystrom¨ method for gaussian process prediction. 2002. Bibliography 204

[110] Evert J Nystrom.¨ Uber¨ die praktische auflosung¨ von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 54(1):185–204, 1930.

[111] Kai Zhang and James T Kwok. Clustered nystrom¨ method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks, 21(10):1576–1587, 2010.

[112] Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

[113] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping using the nystrom¨ method. IEEE Trans. Pattern Anal. Mach. Intell., 26(2):214–225, January 2004.

[114] Mohamed-Ali Belabbas and Patrick J Wolfe. Fast low-rank approximation for covariance matrices. In Computational Advances in Multi-Sensor Adap- tive Processing, 2007. CAMPSAP 2007. 2nd IEEE International Workshop on, pages 293–296. IEEE, 2007.

[115] Mohamed-Ali Belabbas and Patrick J Wolfe. On sparse representations of linear operators and the approximation of matrix products. In Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, pages 258–263. IEEE, 2008.

[116] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. On sampling-based ap- proximate spectral decomposition. In Proceedings of the 26th Annual Inter- national Conference on Machine Learning, ICML ’09, page 553–560, New York, NY, USA, 2009. Association for Computing Machinery.

[117] Mu Li, James T. Kwok, and Bao-Liang Lu. Making large-scale nystrom¨ approximation possible. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 631–638, Madison, WI, USA, 2010. Omnipress. Bibliography 205

[118] Lester Mackey, Ameet Talwalkar, and Michael I. Jordan. Divide-and- conquer matrix factorization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 1134–1142, Red Hook, NY, USA, 2011. Curran Associates Inc.

[119] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved nystrom¨ low-rank approximation and error analysis. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 1232–1239, New York, NY, USA, 2008. Association for Computing Machinery.

[120] Ivan Kassal, James D Whitfield, Alejandro Perdomo-Ortiz, Man-Hong Yung, and Alan´ Aspuru-Guzik. Simulating chemistry using quantum computers. Annual review of physical chemistry, 62:185–207, 2011.

[121] Stephen P Jordan and Pawel Wocjan. Efficient quantum circuits for arbitrary sparse unitaries. Physical Review A, 80(6):062301, 2009.

[122] A Yu Kitaev. Quantum computations: algorithms and error correction. Rus- sian Mathematical Surveys, 52(6):1191–1249, 1997.

[123] Mario Szegedy. Quantum speed-up of markov chain based algorithms. In Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on, pages 32–41. IEEE, 2004.

[124] Milton Abramowitz and Irene A Stegun. Handbook of mathematical func- tions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964.

[125] Robin Kothari. Efficient algorithms in quantum query complexity. PhD the- sis, University of Waterloo, 2014.

[126] W. G. Bickley, L. J. Comrie, J. C. P. Miller, D. H. Sadler, and A. J. Thompson. Bessel Functions: Part II. Functions of Positive Integer Order. Cambridge University Press, 1960. Bibliography 206

[127] F. W. J. Olver. Error analysis of miller’s recurrence algorithm. Mathematics of Computation, 18(85):65–74, 1964.

[128] Cupjin Huang, Michael Newman, and Mario Szegedy. Explicit lower bounds on strong quantum simulation. arXiv preprint arXiv:1804.10368, 2018.

[129] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algo- rithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025–1041, 2004.

[130] Ritsuo Nakamoto. A norm inequality for hermitian operators. The American mathematical monthly, 110(3):238, 2003.

[131] Roy Mathias. Approximation of matrix-valued functions. SIAM journal on matrix analysis and applications, 14(4):1061–1063, 1993.

[132] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389–434, August 2012.

[133] Alexei Borisovich Aleksandrov and Vladimir Vsevolodovich Peller. Opera- tor lipschitz functions. Russian Mathematical Surveys, 71(4):605, 2016.

[134] Daniel Hsu. Weighted sampling of outer products. arXiv preprint arXiv:1410.4429, 2014.

[135] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystrom¨ method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pages 476–484, 2012.

[136] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neu- ral information processing systems, pages 1313–1320, 2009.

[137] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding struc- ture with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011. Bibliography 207

[138] Shusen Wang and Zhihua Zhang. Improving cur matrix decomposition and the nystrom¨ approximation via adaptive sampling. The Journal of Machine Learning Research, 14(1):2729–2769, 2013.

[139] Shusen Wang, Zhihua Zhang, and Tong Zhang. Towards more efficient spsd matrix approximation and cur matrix decomposition. The Journal of Machine Learning Research, 17(1):7329–7377, 2016.

[140] Zhikuan Zhao, Jack K Fitzsimons, and Joseph F Fitzsimons. Quantum as- sisted gaussian process regression. arXiv preprint arXiv:1512.03929, 2015.

[141] Seth Lloyd, Silvano Garnerone, and Paolo Zanardi. Quantum algo- rithms for topological and geometric analysis of big data. arXiv preprint arXiv:1408.3106, 2014.

[142] Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. Quantum-inspired algorithms in practice. arXiv preprint arXiv:1905.10415, 2019.

[143] Yogesh Dahiya, Dimitris Konomis, and David P Woodruff. An empirical evaluation of sketching for numerical linear algebra. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1292–1300, 2018.

[144] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195, 2017.

[145] Rocco A Servedio and Steven J Gortler. Equivalences and separations be- tween quantum and classical learnability. SIAM Journal on Computing, 33(5):1067–1092, 2004.

[146] Srinivasan Arunachalam and Ronald De Wolf. Optimal quantum sample complexity of learning algorithms. The Journal of Machine Learning Re- search, 19(1):2879–2878, 2018. Bibliography 208

[147] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algo- rithms for supervised and unsupervised machine learning. arXiv preprint arXiv:1307.0411, 2013.

[148] Nathan Wiebe, Alex Bocharov, Paul Smolensky, Krysta Svore, and Matthias Troyer. Quantum language processing. arXiv preprint arXiv:1902.05162, 2019.

[149] Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in training landscapes. Nature communications, 9(1):4812, 2018.

[150] Maria Schuld, Alex Bocharov, Krysta Svore, and Nathan Wiebe. Circuit- centric quantum classifiers. arXiv preprint arXiv:1804.00633, 2018.

[151]M aria´ Kieferova´ and Nathan Wiebe. Tomography and generative training with quantum boltzmann machines. Phys. Rev. A, 96:062327, Dec 2017.

[152] Maria Schuld and Nathan Killoran. Quantum machine learning in feature hilbert spaces. Physical review letters, 122(4):040504, 2019.

[153] Jonathan Romero, Jonathan P Olson, and Alan Aspuru-Guzik. Quantum autoencoders for efficient compression of quantum data. Quantum Science and Technology, 2(4):045001, 2017.

[154] Anthony Chefles and Stephen M Barnett. Strategies and networks for state- dependent quantum cloning. Physical Review A, 60(1):136, 1999.

[155] Hilbert J Kappen. Learning quantum models from quantum or classical data. arXiv preprint arXiv:1803.11278, 2018.

[156] Mohammad H Amin, Evgeny Andriyash, Jason Rolfe, Bohdan Kulchyt- skyy, and Roger Melko. Quantum boltzmann machine. Physical Review X, 8(2):021050, 2018. Bibliography 209

[157] Daniel Crawford, Anna Levit, Navid Ghadermarzy, Jaspreet S Oberoi, and Pooya Ronagh. Reinforcement learning using quantum boltzmann machines. arXiv preprint arXiv:1612.05695, 2016.

[158] Marcello Benedetti, John Realpe-Gomez,´ Rupak Biswas, and Alejandro Perdomo-Ortiz. Quantum-assisted learning of hardware-embedded proba- bilistic graphical models. Physical Review X, 7(4):041052, 2017.

[159] Nathan Wiebe, Ashish Kapoor, Christopher Granade, and Krysta M Svore. Quantum inspired training for boltzmann machines. arXiv preprint arXiv:1507.02642, 2015.

[160] Emile Aarts and Jan Korst. and boltzmann machines. 1988.

[161] Ruslan Salakhutdinov, Andriy Mnih, and . Restricted boltz- mann machines for collaborative filtering. In Proceedings of the 24th inter- national conference on Machine learning, pages 791–798. ACM, 2007.

[162] Tijmen Tieleman. Training restricted boltzmann machines using approxi- mations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.

[163] Nicolas Le Roux and . Representational power of re- stricted boltzmann machines and deep belief networks. Neural computation, 20(6):1631–1649, 2008.

[164] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009.

[165] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltz- mann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 693–700, 2010.

[166] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolu- tional deep belief networks for scalable unsupervised learning of hierarchical Bibliography 210

representations. In Proceedings of the 26th annual international conference on machine learning, pages 609–616. ACM, 2009.

[167] Geoffrey E Hinton. A practical guide to training restricted boltzmann ma- chines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.

[168] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised feature learning for audio classification using convolutional deep belief net- works. In Advances in neural information processing systems, pages 1096– 1104, 2009.

[169] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing, 20(1):14–22, 2011.

[170] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.

[171] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, 2017.

[172] Giacomo Torlai and Roger G Melko. Learning thermodynamics with boltz- mann machines. Physical Review B, 94(16):165134, 2016.

[173] Yusuke Nomura, Andrew S Darmawan, Youhei Yamaji, and Masatoshi Imada. Restricted boltzmann machine learning for solving strongly corre- lated quantum systems. Physical Review B, 96(20):205152, 2017.

[174] Dorit Aharonov, Vaughan Jones, and Zeph Landau. A polynomial quantum algorithm for approximating the jones polynomial. Algorithmica, 55(3):395– 421, 2009. Bibliography 211

[175] Joran van Apeldoorn, Andras´ Gilyen,´ Sander Gribling, and Ronald de Wolf. Quantum sdp-solvers: Better upper and lower bounds. arXiv preprint arXiv:1705.01843, 2017.

[176] Scott Aaronson. The learnability of quantum states. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sci- ences, volume 463, pages 3089–3114. The Royal Society, 2007.

[177] Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum speed-up in supervised machine learning. arXiv preprint arXiv:2010.02174, 2020.

[178] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R McClean. Power of data in quantum machine learning. arXiv preprint arXiv:2011.01938, 2020.

[179] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum am- plitude amplification and estimation. Contemporary Mathematics, 305:53– 74, 2002.

[180] LD Landau and EM Lifshitz. Statistical physics, vol. 5. Course of theoretical physics, 30, 1980.

[181] David Poulin and Pawel Wocjan. Sampling from the thermal quantum gibbs state and evaluating partition functions with a quantum computer. Physical review letters, 103(22):220502, 2009.

[182] Anirban Narayan Chowdhury and Rolando D Somma. Quantum algo- rithms for and hitting-time estimation. arXiv preprint arXiv:1603.02940, 2016.

[183] Man-Hong Yung and Alan´ Aspuru-Guzik. A quantum–quantum metropolis algorithm. Proceedings of the National Academy of Sciences, 109(3):754– 759, 2012.