Faster quantum-inspired algorithms for solving linear systems

Changpeng Shao∗1 and Ashley Montanaro†1,2

1School of Mathematics, University of Bristol, UK 2Phasecraft Ltd. UK

March 19, 2021

Abstract We establish an improved classical algorithm for solving linear systems in a model analogous to the QRAM that is used by quantum linear solvers. Precisely, for the linear system Ax = b, we show that there is a classical algorithm that outputs a data structure for x allowing sampling and querying to the entries, where x is such that kx−A−1bk ≤ kA−1bk. This output can be viewed as a classical analogue to the output of quantum linear solvers. The complexity of our algorithm 6 2 2 −1 −1 is Oe(κF κ / ), where κF = kAkF kA k and κ = kAkkA k. This improves the previous best 6 6 4 algorithm [Gily´en,Song and Tang, arXiv:2009.07268] of complexity Oe(κF κ / ). Our algorithm is based on the randomized Kaczmarz method, which is a particular case of stochastic gradient descent. We also find that when A is row sparse, this method already returns an approximate 2 solution x in time Oe(κF ), while the best known returns |xi in time Oe(κF ) when A is stored in the QRAM data structure. As a result, assuming access to QRAM and if A is row sparse, the speedup based on current quantum algorithms is quadratic.

1 Introduction

Since the discovery of the Harrow-Hassidim-Lloyd algorithm [16] for solving linear systems, many quantum algorithms for solving problems were proposed, e.g. [2, 19–21, 24, 31]. Most of them claimed that quantum computers could achieve exponential speedups over classical algorithms. However, as shown by Tang [36] and some subsequent works [5–7,10,11,17,35] (currently known as quantum-inspired algorithms), for many previously discovered

arXiv:2103.10309v1 [quant-ph] 18 Mar 2021 algorithms, the quantum computer actually only achieves a polynomial speedup in the dimension. Tang’s work makes us reconsider the advantages of quantum computers in machine learning. For instance, one may wonder what is the best possible separation between quantum and classical computers for solving a given machine learning problem? Here we focus on the solving of linear systems, which is a fundamental problem in machine learning. For a linear system Ax = b, the currently best quantum algorithm seems to be the one building on the technique of block-encoding. For simplicity, assume that the singular values of A + lie in [1/κ, 1], where κ is the condition number of A. Let α ∈ R , if A/α is efficiently encoded as a block of a unitary, the complexity to obtain |A−1bi up to error  is Oe(ακ)[4], where the Oe notation

[email protected][email protected]

1 hides logarithmic factors in all parameters (including κ,  and the size of A). For example, when 1 we have the QRAM data structure for A, we can take α = kAkF . When A is s-sparse and given in the sparse-access input model, we can take α = skAkmax. In the classical case, the quantum- 6 22 6 −1 inspired algorithm given in [6] costs Oe(κF κ / ) to output a classical analogue to |A bi, where −1 2 6 6 4 κF = kAkF kA k, and  is the accuracy. This result was recently improved to Oe(κF κ / )[11]. Although the quantum-inspired algorithms are polylog in the dimension, the heavy dependence on κF , κ,  may make them hard to use in practice [1]. So it is necessary to improve the efficiency of quantum-inspired algorithms. Meanwhile, this can help us better understand the quantum advantages in certain problems. QRAM is a widely used data structure in many quantum machine learning algorithms (e.g. [19– 21]). It allows us efficiently to prepare a quantum state corresponding to the data and to perform quantum linear algebraic operations. The quantum-inspired algorithms usually assume a data structure which is a classical analogue to the QRAM. For a vector, this data structure allows us to do the sampling and query operations to the entries. The sampling operation can be viewed as measurements of the quantum state. As for the problem of solving linear systems, in the QRAM model, the input of quantum algorithms is the QRAM data structures for A and b, and the output is |xi such that k|xi − |A−1bik ≤ . In the classical setting, the input of quantum-inspired algorithms is the classical analogue of QRAM data structures for A and b, and the output is a data structure for x such that kx − A−1bk ≤ kA−1bk. This makes the comparison of quantum/quantum-inspired algorithms reasonable. There are also some other quantum models, such as the sparse-access input model. This allows efficient quantum algorithms for solving sparse linear systems [8]. However, because of the similarity between QRAM and the model used in quantum-inspired algorithms, in this paper we will mainly focus on the QRAM model for quantum linear solvers. Randomized iterative methods have received a lot of attention recently in solving linear systems because of their connection to the stochastic gradient descent method. The randomized Kaczmarz method [34] is a typical example. It was first proposed by Kaczmarz in 1937 [18], and rediscovered by Gordon, Bender and Herman in 1970 [14]. After that, there appear many generalizations [15, 26, 27, 32]. Kaczmarz method is a particular case of the stochastic gradient descent algorithm for solving linear systems. In its simplest version, at each step of iteration, it randomly chooses a row of the matrix with probability corresponding to the norm of that row, then finds the closest solution that satisfies this linear constraint. Geometrically, the next point is the orthogonal projection of the previous one onto the hyperplane defined by the chosen linear constraint. The simple iterative structure and clear geometric meaning greatly simplifies the analysis of this method and makes it popular in applications like computed tomography [3]. Usually, the randomized iterative methods need to do some importance sampling according to the norms of the rows. Sampling access is made available by the data structure for quantum-inspired algorithms and the QRAM for quantum linear solvers. So in this paper, for the fairness of comparison we also assume a similar data structure for the randomized iterative methods.

1.1 Main results

In this paper, we propose two quantum-inspired algorithms for solving linear systems. One is based on the randomized Kaczmarz method [34], and the other one is based on the randomized

1 In fact, α = µ(A) for some µ(A) ≤ kAkF (see [12]). The definition of µ(A) is quite complicated. In this paper, for convenience of the comparison we just use kAkF . 2 In the literature, κF does not have a standard name. It is named the scaled condition number in [15, 34].

2 coordinate descent method [22]. The second algorithm only works for symmetric positive-definite (SPD) matrices. Our results are summarized in Table1.

Complexity Reference Assumptions

Quantum algorithm Oe(κF ) [4], 2018 Oe(sκ2 ) [34], 2009 row sparse Randomized classical algorithm F Oe(sTr(A)kA−1k) [22], 2010 sparse, SPD 6 6 4 Oe(κF κ / ) [11], 2020 6 2 2 Quantum-inspired algorithm Oe(κF κ / ) Theorem 12 Oe(Tr(A)3kA−1k3κ/2) Theorem 13 SPD

Table 1: Comparison of different algorithms for solving the linear system Ax = b, where κF = −1 −1 kAkF kA k, κ = kAkkA k and s is the row sparsity. SPD = symmetric positive definite.

In the table, the quantum algorithm is based on the QRAM model. The randomized classical and quantum-inspired algorithms assume a data structure that is a classical analogue to the QRAM (see Section3). So all those algorithms are based on similar data structures, which make the comparison in the table fair enough.

6 2 2 • When A is dense, the complexity of our algorithm is Oe(κF κ / ). If A is additionally SPD, the complexity becomes Oe(Tr(A)3kA−1k3κ/2). This reduces the complexity of the previous quantum-inspired algorithm given by Gily´en et al. [11] by at least a 4th power in terms of the dependence on κ, and improves the dependence on 1/ from quartic to quadratic. One result not shown in the table is about the description of the solution. In [11], the authors define an s-sparse description of a vector x as an s-sparse vector y such that x = A†y. The sparsity of y 2 2 2 determines the runtime of querying one entry of x. In [11], they obtain an Oe(κF κ / )-sparse 6 2 4 description of the approximate solution in cost Oe(κF κ / ). In comparison, we obtain an 2 6 2 2 Oe(κF )-sparse description in cost Oe(κF κ / ) by Theorem 12. • When A is row sparse, i.e., each row is sparse, the randomized Kaczmarz method can find 2 an approximate sparse solution in time Oe(sκF ), where s is the row sparsity. Moreover, the output is a vector, which is stronger than the outputs of quantum and quantum-inspired algorithms. Assuming s = Oe(1), this result means that the quantum speedup of solving row sparse linear systems is quadratic if using the QRAM data structure. It also means that to explore large quantum speedups in solving sparse linear systems, we may need to move our attentions to other quantum models. For example, it is known that when A is row and column −1 sparse, the quantum algorithm has complexity Oe(skAkmaxkA k) in the sparse access input model [4], which is better than the quantum result we show in the table.

• When A is sparse and SPD, the randomized coordinate descent method can find an approx- imate sparse solution (also a sparse vector) in time Oe(sTr(A)kA−1k). In this special case, it may happen that quantum computers do not achieve a speedup in solving linear systems. For example, suppose A is a density matrix with kAk = Θ(1), s = Oe(1), then the randomized coordinate descent method costs Oe(κ), which is also the cost of the quantum linear solver. As shown in [30], to solve a sparse SPD linear system in a quantum computer, the dependence on κ is at least linear. However, for some special SPD linear systems, quantum algorithms

3 can achieve quadratic speedups in terms of κ.3 However, the special cases mentioned in [30] seem not directly related to the case we mentioned above.

• In Table1, the complexity of the quantum algorithm is the cost to obtain the quantum state corresponding to the solution. For quantum-inspired algorithms, the table shows the complexity to obtain the data structure for the solution that allows sampling and querying. This output can be viewed as a classical analogue to the output of the quantum algorithm. If we want to estimate the norm of the solution, then there is an extra factor 1/ for the quantum algorithm [4] and an extra factor 1/2 for the quantum-inspired algorithms [6] (also see Lemma 7). However, for the randomized classical algorithms, the complexity does not change because their outputs are sparse vectors whose sparsity has the same order as the . So to estimate the norm of the solution, the quantum and quantum-inspired algorithms are worse than the randomized classical algorithms in terms of the precision in the row sparse case.

• For solving a row sparse linear system, the table shows that the complexities of quantum and classical randomized algorithms are logarithmic in the precision . However, if we write down the complexity more precisely, the classical randomized algorithms may have better depen- dence on , at least compared with the currently best quantum algorithm. Indeed, for the classical randomized algorithms, the dependence on  is O(log(1/)), while it is O(log2(1/)) for the quantum algorithm [4]. If the linear system is also column sparse, the dependence on  can be reduced to O(log(1/)) in the quantum case [23].

1.2 Summary of the techniques

Our main algorithm is based on a generalization of Kaczmarz method. As briefly introduced above (also see Section2 for more details), the Kaczmarz method aims to solve one linear constraint at each step of the iteration, which can be viewed as a particular case of the stochastic gradient method. To design an efficient quantum-inspired algorithm, we concentrate on the dual form of the Kaczmarz method. Namely, introducing y such that x = A†y, then focusing on the iteration on y. The linear system Ax = b can be reformulated as the least squares problem (LSP) x = arg minx kAx − bk. It turns out that y converges to the optimal solution of the dual problem of the LSP. Our idea is to apply the iterative method to find a sparse approximate solution y of the dual problem. The sparsity is guaranteed by the iterative method. Now x = A†y gives an approximate solution of the LSP. Since y is sparse, it is easy to query an entry of x. To sample an entry of x, we can use the rejection sampling idea. The technical part of this paper is the complexity analysis of the rejection sampling method. However, it turns out that the main cost of the above algorithm comes from the computation of y, whose complexity analysis is straightforward. It is time-consuming if we directly use the iterative method because it relates to the calculation of inner products. To alleviate this, we modify the iteration by using the sampling idea. We prove that this modified method has the same convergence rate as the original one by choosing some parameters appropriately. Notation. For a matrix A, we use A† to denote its conjugate transpose, A−1 to denote its −1 Moore-Penrose inverse. The scaled condition number is κF = kAkF kA k, and the condition −1 number is κ = kAkkA k, where k · kF is the Frobenius norm and k · k is the spectral norm. The

3In [30], the authors mentioned two special cases. Case 1: Access to a block-encoding of I−ηA and kA−1bk ∈ O(κ), where η ∈ (0, 1]. Case 2: A is a sum of SPD local Hamiltonians, b is sparse and a parameter γ ∈ O(1) that quantifies the overlap of b with the subspace where A is non-singular.

4 row sparsity of A is defined as the maximum of the number of nonzero zero entries of any row. For a unit vector v, we sometimes denote it as |vi, and denote v† as hv|. For two vectors a, b, we use ha|bi or a · b to denote their inner product. If m is an integer, we define [m] := {1, 2, . . . , m}. For matrix A, we denote its i-th row as Ai∗, its j-th column as A∗j and the (i, j)-th entry as Ai,j. For a vector v, we denote the i-th component as vi. We use {e1,..., em} to enote the standard basis m of C .

2 Randomized Kaczmarz algorithm

Let A be an m×n matrix, b be an m×1 vector. Solving the linear system Ax = b can be described as a quadratic optimization problem, i.e., the so-called least squares problem (LSP)

arg min kAx − bk2. (1) x∈Cn Classically, there are many algorithms for solving the LSP [13]. The best known classical algorithm might be the conjugate-gradient (CG) method. Compared to CG, the Kaczmarz algorithm has received much attention recently due to its simplicity. For instance, at each step of iteration, CG uses the whole information of the matrix A and O(mn) operations, while Kaczmarz method only uses one row of A and O(n) operations. So Kaczmarz method becomes effective when the entire matrix (data set) cannot be loaded into memory.

Definition 1 (Randomized Kaczmarz algorithm [34]). Let x0 be an arbitrary initial approximation to the solution of (1). For k = 0, 1,..., compute

brk − hArk∗|xki xk+1 = xk + 2 Ark∗, (2) kArk∗k where Ark∗ is the rk-th row of A, and rk is chosen from [m] randomly with probability proportional 2 to kArk∗k .

The Kaczmarz algorithm has a clear geometric meaning, see Figure1. Geometrically, xk+1 is the orthogonal projection of xk onto the hyperplane Ark∗ · x = brk . So at each step of iteration, this method finds the closest vector that satisfies the chosen linear constraint.

A2∗ · x = b2

x2 x0

x4

x6

A1∗ · x = b1 x∗ x5 x3 x1

Figure 1: Illustration of Kaczmarz algorithm when m = 2, where x∗ is the optimal solution.

From the definition, we can simplify the notation by defining

˜ −1 −1 b = diag(kAi∗k : i ∈ [m])b, Ae = diag(kAi∗k : i ∈ [m])A, (3)

5 ˜ so that bi = bi/kAi∗k and Aei∗ = Ai∗/kAi∗k. Then we can rewrite (2) as

˜    ˜ xk+1 = xk + brk − hAerk∗|xki Aerk∗ = I − |Aerk∗ihAerk∗| xk + brk Aerk∗. (4)

In [33], the Kaczmarz algorithm has been quantized based on the description (4). As an orthogonal projector, it is not hard to extend I − |Aei∗ihAei∗| to a block of a unitary operator. For example, a unitary of the form I ⊗ (I − |Aei∗ihAei∗|) + X ⊗ |Aei∗ihAei∗|, where X is Pauli-X. This unitary is closely related to the Grover diffusion operator when we apply it to |−i|xi.

Lemma 2 (Theorem 2 of [34]). Let x∗ be the solution of (1). Then the randomized Kaczmarz algorithm converges to x∗ in expectation, with the average error

2 −2 T 2 E[kxT − x∗k ] ≤ (1 − κF ) kx0 − x∗k , (5)

−1 where κF = kAkF kA k.

The above result holds for all x0, so throughout this paper we shall assume x0 = 0 for simplicity. 2 In the above lemma, by Markov’s inequality, with high probability 0.99, we have kxT − x∗k ≤ −2 T 2 2 2 100(1−κF ) kx∗k . To make sure the error is bounded by  kx∗k with high probability, it suffices to choose 2 T = O(κF log(1/)). (6) Pk−1 From (2), we know that there exist yk,0, . . . , yk,k−1 such that xk = j=0 yk,jAr(j)∗, that is † xk = A yk for some vector yk. One obvious fact is that yk is at most k-sparse. In [11], yk is called the sparse description of xk. So this description comes easily in the Kaczmarz method. Actually, yk converges to the optimal solution of the dual problem of the LSP (1). More precisely, an alternative formulation of LSP is to find the least-norm solution of the linear system 1 min kxk2, s.t.Ax = b. x∈Cn 2 The dual problem takes the form: 1 min g(y) := kA†yk2 − b · y. (7) y∈Cm 2 The randomized Kaczmarz algorithm (2) is equivalent to one step of the stochastic gradient descent 1 2 2 method [28] applied to 2 (Ark∗ · x − brk ) with stepsize 1/kArk∗k . It is also equivalent to one step of the randomized coordinate descent method [29] applied to the dual problem (7). Namely, when † we apply this method to the rk-th component, the gradient is ∇rk g = hArk∗|A |yki − brk . Setting 2 the stepsize as 1/kArk∗k leads to the updating

† brk − hArk∗|A |yki yk+1 = yk + 2 erk , (8) kArk∗k

m where {e1,..., em} is the standard basis of C . We can recover the original Kaczmarz iteration (2) by multiplying A† on both sides of (8). The Kaczmarz method can be generalized to use multiple rows at each step of iteration [15]. Instead of orthogonally projecting to a hyperplane determined by one linear constraint, we can orthogonally project to a vector space defined by several linear constraints. This idea is similar

6 to the minibatch stochastic gradient descent. However, this generalization needs to compute the pseudoinverse of certain matrices. A simple pseudoinverse-free variant is as follows [26]:

α X bi − hAi∗|xki xk+1 = xk + 2 Ai∗, (9) 2q kAi∗k i∈Tk where Tk is a random set of q row indices sampled with replacement, α is known as the relaxation 2 parameter. Each index i ∈ [m] is put into Tk with probability proportional to kAi∗k . For instance, let α = 2, then (9) can be rewritten in a parallel form   1 X bi − hAi∗|xki xk+1 = xk + 2 Ai∗ . q kAi∗k i∈Tk

2 2 It was proved in [26, Corollary 3] that if α = q = kAkF /kAk (also see a related proof in Appendix B), 2 −2T 2 E[kxT − x∗k ] ≤ 1 − κ kx0 − x∗k . (10) So the number of iterations required to achieve error  via (9) is O(κ2 log(1/)).

3 Preliminaries of quantum-inspired algorithms

This section briefly recalls some definitions and results about quantum-inspired algorithms that will be used in this paper. The main reference is [6].

n Definition 3 (Query access). For a vector v ∈ C , we have Q(v), query access to v if for all i ∈ [n], m×n we can query vi. Likewise, for a matrix A ∈ C , we have Q(A) if for all (i, j) ∈ [m] × [n], we can query Aij. Let q(v) (or q(A)) denote the (time) cost of such a query.

n Definition 4 (Sampling and query access to a vector). For a vector v ∈ C , we have SQ(v), sampling and query access to v, if we can:

n 1. Sample: obtain independent samples i ∈ [n] following the distribution Dv ∈ R , where Dv(i) = 2 2 |vi| /kvk . 2. Query: query entries of v as in Q(v);

3. Norm: query kvk.

Let q(v), s(v), and n(v) denote the cost of querying entries, sampling indices, and querying the norm respectively. Further define sq(v) := q(v) + s(v) + n(v).

m×n Definition 5 (Sampling and query access to a matrix). For any matrix A ∈ C , we have SQ(A) if we have SQ(Ai∗) for all i ∈ [m] and SQ(a), where a is the vector of row norms. We define sq(A) as the cost to obtain SQ(A).

The sampling and query access defined above can be viewed as a classical analogue to the QRAM data structure, which is widely used in many quantum machine learning (QML) algorithms. In QML, this data structure can be used to efficiently prepare the quantum states of vectors, which are usually the inputs of the QML algorithms. For matrices, it allows us to find an efficient block- encoding so that many quantum linear algebraic techniques become effective. However, the QRAM

7 (and Definition5) is a quite strong data structure because it contains a lot of information (e.g. kAi∗k and kAkF ) about A, which is usually not easy to obtain. For instance, it usually needs Θ(kAk0) operations to build this data structure, which kAk0 is the number of nonzero entries of A. For quantum-inspired algorithms, we need to use certain sampling techniques to make sure that the complexity has a polylog dependence on the dimension, so it seems quite natural to use the above data structure. In comparison, there are also some other efficient quantum models, e.g., sparse-access model, in which the QML algorithms are still efficient. The sampling operation on v can be viewed as a classical analogue of the measurement of the −1 P quantum state |vi = kvk i∈[n] vi|ii in the computational basis. For some problems, we do not have the state |vi exactly. Instead, what we have is a state of the form sin(θ)|vi|0i + cos(θ)|wi|1i. 2 vi 2 Then the probability of obtaining |i, 0i is kvk2 sin (θ). This leads to the definition of oversampling for quantum-inspired algorithms. n Definition 6 (Oversampling and query access). For v ∈ C and φ ≥ 1, we have SQφ(v), φ- n oversampling and query access to v, if we have Q(v) and SQ(v˜) for a vector v˜ ∈ C satisfying 2 2 2 2 kv˜k = φkvk and |v˜i| ≥ |vi| for all i ∈ [n]. We denote sφ(v) = s(v˜), qφ(v) = q(v˜), nφ(v) = n(v˜), and sqφ(v) = sφ(v) + qφ(v) + q(v) + nφ(v). For example, if we view v as the vector (v, 0,..., 0) and v˜ as the vector (v, kvk cot(θ)w) so that they have the same dimension, then it is easy to see that kv˜k2 = kvk2/ sin2(θ) and φ = 1/ sin2(θ). 2 2 The condition |v˜i| ≥ |vi| is also satisfied. But the above definition is stronger than this intuition.

Lemma 7 (Lemma 2.9 of [6]). Given SQφ(v) and δ ∈ (0, 1] we can sample from Dv with success probability at least 1 − δ and cost O(φsqφ(v) log(1/δ)). We can also estimate kvk up to relative −2 error  in time O( φsqφ(v) log(1/δ)). P Lemma 8 (Lemma 2.10 of [6]). Given SQϕi (vi) and λi ∈ C for i ∈ [k], denote v = i λivi, then P 2 2 P we have SQφ(v) for φ = k i ϕikλivik /kvk and sqφ(v) = maxi sϕi (vi) + i q(vi). The first lemma tells us how to construct the original sampling access from an over-sampled one. By Definition6, given SQφ(v), we have Q(v). Lemma7 tells us how to obtain S(v) from SQφ(v). So this lemma actually tells us how to obtain SQ(v) from SQφ(v). The second lemma is about how to construct a linear combination of the data structures. Equivalently, it is about constructing the data structure for Mc when we have the data structures of M and c, where M is a matrix and c is a vector.

In our algorithms below, vi will be a row of A, k will refer to the sparsity of a certain known vector and v will be the approximate solution to the linear system. So in Lemma8, sqφ(v) = O(ksq(A)). Combing the above two lemmas, given SQ(vi) and λi, the cost to sample and query v is Oe(kφsq(A)), and the cost to estimate its norm is Oe(kφsq(A)/2). So a main difficulty in the the complexity analysis is the estimation of φ.

4 The algorithm for solving linear systems

4.1 Solving general linear systems

In this section, based on the Kaczmarz method, we give a quantum-inspired classical algorithm for solving linear systems. The basic idea is similar to [11]. However, our result has a lower complexity and the analysis is much simpler. The main idea is as follows: we first use the iteration (8) to † compute yT , then output SQ(xT ) with xT = A yT by Lemmas7 and8.

8 † † When A is dense, the calculation of hArk∗|A |yki in (8) is expensive. Note that hArk∗|A |yki = Pn j=1 Ark,jhA∗j|yki. As an inner product, we can use the sampling idea (e.g. Monte Carlo) to approximate it. Now let j1, . . . , jd be drawn randomly from [n] under the distribution that Pr[j] = 2 2 kA∗jk /kAkF . This distribution is designed to put more weight on the most important rows [9,34]. Let S = {j1, . . . , jd} be the sequence of the chosen indices. We modify the updating of y to:   2 1 ˜ 1 X ˜ kAkF yk+1 = yk + brk − Ark,jhA∗j|yki 2  erk . (11) kAr ∗k d kA∗jk k j∈S

For conciseness, define the matrix D as

2 X kAkF D = 2 |jihj|. (12) dkA∗jk j∈S Then ˜ ˜ † brk − hArk∗|DA |yki yk+1 = yk + erk (13) kArk∗k ˜ ˜ † ˜ † brk − hArk∗|A |yki hArk∗|(I − D)A |yki = yk + erk + erk . (14) kArk∗k kArk∗k Compared to the original iteration (8), the procedure (14) contains a perturbation term. To analyze its convergence rate, we need to determine d. Before that, we first analyze the properties of the following random variable:

2 ˜ † ˜ ˜ 1 X ˜ kAkF µk := hArk∗|(I − D)A |yki = hArk∗|I − D|xki = hArk∗|xki − Ark,jxk,j 2 . (15) d kA∗jk j∈S

2 2 kAkF kxkk Lemma 9. The mean and the variance of µk satisfy D[µk] = 0, VarD[µk] ≤ 2 . E d minj∈[n] kA∗j k

Proof. The results follow from the standard calculation.   2 ˜ 1 X ˜ kAkF ED[µk] = hArk∗|xki − ED  Ark,jxk,j 2  d kA∗jk j∈S  2  ˜ ˜ kAkF = hArk∗|xki − ED Ark,jxk,j 2 kA∗jk n 2 2 X kAk kA∗jk = hA˜ |x i − A˜ x F rk∗ k rk,j k,j kA k2 kAk2 j=1 ∗j F = 0.   2 1 X ˜ kAkF VarD[µk] = VarD  Ark,jxk,j 2  d kA∗jk j∈S  2  X 1 ˜ kAkF = VarD Ark,jxk,j 2 d kA∗jk j∈S " 2 2# 1 ˜ kAkF ≤ ED Ark,jxk,j 2 d kA∗jk

9 n  2 2 2 1 X kAk kA∗jk = A˜ x F d rk,j k,j kA k2 kAk2 j=1 ∗j F n 1 X kAk2 = A˜2 x2 F d rk,j k,j kA k2 j=1 ∗j 2 1 kAkF 2 ≤ 2 kxkk . d minj∈[n] kA∗jk

The last inequality is caused by the fact that A˜ is normalized so that A˜2 ≤ 1. rk∗ rk,j

† From (14), the updating of xk = A yk then becomes ˜ ˜ ˜ ˜ ˜ xk+1 = xk + (brk − hArk∗|xki)Ark∗ + hArk∗|I − D|xkiArk∗. (16) Next, we prove a similar result to Lemma2. The idea of our proof is similar to that of Lemma2 from [34], which nicely utilizes the geometrical structure of the Kaczmarz method.

2 2 2 4kAkF κF (log 2/ ) 2 2 Proposition 10. Choose d = 2 2 , then after T = κ (log 2/ ) steps of iteration of the  minj∈[n] kA∗j k F 2 2 2 procedure (16), we have E[kxT − x∗k ] ≤  kx∗k .

Proof. For any vector v such that Av 6= 0 we have kvk2 ≤ kA−1k2kAvk2. This implies

2 2 m 2 kvk kvk X kAi∗k = ≤ hA˜ |vi2 = [hA˜ |vi2]. (17) κ2 kAk2 kA−1k2 i∗ kAk2 E i∗ F F i=1 F Here the random variable is defined by

2 kAi∗k Pr[i] = 2 . kAkF ˜ For simplicity, we set the initial vector x0 = 0. In procedure (16), we denote x˜k+1 = xk + (brk − ˜ ˜ ˜ ˜ hArk∗|xki)Ark∗, it is the orthogonal projection of xk onto the hyperplane Ark∗ · x = brk . Thus we can rewrite the iteration as ˜ xk+1 = x˜k+1 + µkArk∗. The orthogonality implies (see Figure2)

2 2 2 kxk+1 − x∗k = kx˜k+1 − x∗k + µk 2 2 2 = kxk − x∗k − kxk − x˜k+1k + µk 2 ˜ 2 2 = kxk − x∗k − hxk − x∗|Ark∗i + µk   xk − x∗ ˜ 2 2 2 = 1 − h |Ark∗i kxk − x∗k + µk. kxk − x∗k Taking the expectation yields    2 xk − x∗ ˜ 2 2 2 E[kxk+1 − x∗k ] = 1 − E h |Ark∗i E[kxk − x∗k ] + E[µk]. kxk − x∗k

In the above, the expectation is taken first on D by fixing k, rk, then on rk by fixing k, and finally on k. Also if A(xk − x∗) = 0, then xk is already the optimal solution, so we assume that this does not happen before convergence.

10 xk ˜ ˜ Ark∗ · x = brk

µk x˜k+1 xk+1

x∗

Figure 2: Illustration of the orthogonality in the proof of Proposition 10.

By equation (17),   xk − x∗ ˜ 2 −2 1 − E h |Ark∗i ≤ 1 − κF . kxk − x∗k By Lemma9, we obtain 2 2 −2 2 kAkF 2 E[kxk+1 − x∗k ] ≤ 1 − κF E[kxk − x∗k ] + 2 E[kxkk ] d minj∈[n] kA∗jk 2 −2 2 2kAkF 2 2 ≤ 1 − κF E[kxk − x∗k ] + 2 (kx∗k + E[kxk − x∗k ]) d minj∈[n] kA∗jk  2  2 −2 2kAkF 2 2kAkF 2 = 1 − κF + 2 E[kxk − x∗k ] + 2 kx∗k . d minj∈[n] kA∗jk d minj∈[n] kA∗jk

2 2 2 4kAkF κF (log 2/ ) Now we choose d = 2 2 , then the above estimation leads to  minj∈[n] kA∗j k   2 2 −2 1 −2 2 2 −1 2  2 E[kxk+1 − x∗k ] ≤ 1 − κF + κF  (log 2/ ) E[kxk − x∗k ] + 2 2 kx∗k . 2 2κF (log 2/ ) −2 1 −2 2 2 −1 Setting ρ := 1 − κF + 2 κF  (log 2/ ) which is less than 1, we finally obtain T −1 2 X [kx − x k2] ≤ ρT [kx − x k2] + kx k2 ρi E T ∗ E 0 ∗ 2κ2 (log 2/2) ∗ F i=0 2 T 2 T  2 ≤ ρ kx∗k + 2 2 kx∗k . 2κF (log 2/ ) 2 2 Let T = κF (log 2/ ), then T 2 2 −2 −2 2 −1 ρ = exp(κF (log 2/ ) log(1 − κF + κF  (log 1/) )) ≈ exp(−(log 2/2)(1 − 2(log 1/)−1))) ≈ 2/2. 2 2 2 Therefore, E[kxT − x∗k ] ≤  kx∗k as claimed.

Theorem 11. Assume that x∗ = arg min kAx − bk. Given SQ(A),Q(b), there is an algorithm that returns SQ(x) such that kx − x∗k ≤ kx∗k with probability at least 0.99 and   2  6 kAkF (sq(A) + q(b)) 2 8 2 sq(x) = Oe κF 2 2 + κ sq(A) = Oe(κF (sq(A) + q(b))/ ). (18)  minj∈[n] kA∗jk

11 2 2 2 4kAkF κF (log 2/ ) Proof. Let d = 2 2 . By the updating formula (11), we know that yk is at most k-  minj∈[n] kA∗j k sparse. So in the k-th step of the iteration, the calculation of the summation in (11) costs O(kd). At each step of the iteration, we sample a row of A and query the corresponding entry of b. So the overall cost of k-th step is O(kd(sq(A) + q(b))). After convergence, it costs O(T 2d(sq(A) + q(b))) 2 in total, where T = O(κF log(1/)). † Note that xT = A yT and yT is T -sparse. By Lemma8 (taking vi to be rows of A), we can obtain SQφ(xT ) in time O(T sq(A)), where P 2 2 i yT,ikAi∗k φ = T 2 . kxT k 2 2 In AppendixA, we shall show that φ = O(T κ ). By Lemma7, we can obtain SQ(xT ) in time O(φT sq(A)) = O(T 3κ2sq(A)). Therefore, the total cost is O(T 2d(sq(A) + q(b)) + T 3κ2sq(A)). Substituting the values of d, T into it gives rise to the claimed result.

We next use the Kaczmarz method with averaging (9) to reduce the dependence on κF in the above theorem. Similar to (16), we change the updating rule into 1 X 1 X x = x + (˜b − hA˜ |x i)A˜ + hA˜ |I − D |x iA˜ , (19) k+1 k 2 i i∗ k i∗ 2 i∗ i k i∗ i∈Tk i∈Tk

The definition of Di is similar to (12), which depends on i now. In AppendixB, we shall prove that if 4 2 2 kAkF κ log(1/ ) 4 2 d = 2 2 2 = Oe(κF / ), (20)  kAk minj∈[n] kA∗jk 2 2 2 2 then after T = O(κ log(1/)) steps of iteration, we have E[kxT − x∗k ] ≤  kx0 − x∗k . Together with the estimation of φ in AppendixC, we have the following improved result.

Theorem 12. Assume that x∗ = arg min kAx − bk. Given SQ(A),Q(b), there is an algorithm that returns SQ(x) such that kx − x∗k ≤ kx∗k with probability at least 0.99 and 6 2 2 sq(x) = Oe κF κ (sq(A) + q(b))/ . (21)

Proof. From the updating formula (19), the updating of yk is ˜ ˜ † 1 X bi − hAi∗|DiA |yki yk+1 = yk + ei. 2 kAi∗k i∈Tk 2 2 2 2 So yk is at most (kκF /κ )-sparse. Thus the cost in step k is O(kdκF (sq(A) + q(b))/κ ) = 6 2 2 2 6 2 2 Oe(kκF (sq(A) + q(b))/κ  ). After convergence, the total cost is Oe(T κF (sq(A) + q(b))/κ  ) = 6 2 2 2 4 Oe(κF κ (sq(A)+q(b))/ ). By the estimation in AppendixC, we have φ = O(κF κ ), so the cost to 2 2 4 4 6 2 obtain SQ(xT ) is O(φT κF sq(A)/κ ) = Oe(κF κ sq(A)). Hence, the overall cost is Oe(κF κ (sq(A) + q(b))/2).

4.2 Solving symmetric positive definite linear systems

Now suppose A is a symmetric positive definite matrix. Then a simple approach to solve Ax = b is the randomized coordinate descent iteration [15]

hArk∗|xki − brk xk+1 = xk − erk , (22) Ark,rk

12 where the probability of choosing rk is Ark,rk /Tr(A). If we apply the above iteration to the normal equation A†Ax = A†b, we will obtain the Kaczmarz method. As for the iteration (22), it is not hard to show that  1 k [kx − x k2] ≤ 1 − kx − x k2. E k ∗ kA−1kTr(A) 0 ∗

In fact, assume Ax∗ = b, then (22) implies that   |rkihrk|A xk+1 − x∗ = I − (xk − x∗). Ark,rk Thus   2 |rkihrk|A 2 E[kxk+1 − x∗k ] ≤ E I − E[kxk − x∗k ] Ark,rk

A 2 = I − E[kxk − x∗k ] Tr(A)  1  = 1 − [kx − x k2]. kA−1kTr(A) E k ∗ Similar to the idea introduced above to handle Kaczmarz method, we can use the randomized coordinate descent method to solve dense linear systems Ax = b such that A is symmetric positive 2 −1 2 −1 definite. The only difference is that kAkF becomes Tr(A), and kA k becomes kA k. Moreover, the randomized coordinate descent (22) can also be parallelized into

α X hAi∗|xki − bi xk+1 = xk − ei. (23) 2q Ai,i i∈Tk If we take α = q = Tr(A)/kAk, then it converges after O(κ log(1/)) iterations. The proof is the same as that of (9). Therefore, by a similar argument to the proof of Theorem 12, we have

Theorem 13. Let A be symmetric positive definite. Assume that x∗ = arg min kAx − bk. Given SQ(A),Q(b), there is an algorithm that returns SQ(x) such that kx − x∗k ≤ kx∗k with probability at least 0.99 and Tr(A)3kA−1k3κ(sq(A) + q(b)) sq(x) = Oe . (24) 2 The above result is actually more general than Theorem 12. If we consider the normal equation A†Ax = A†b, Theorem 13 implies Theorem 12.

5 Discussion

5.1 Solving row sparse linear systems

When A is row sparse, the inner product in the Kaczmarz method (2) is not expensive to calculate. As a result, we can directly apply the Kaczmarz method to solve the linear system Ax = b. The result is summarized below, and the proof is straightforward.

Theorem 14. Assume that A has row sparsity s. Let x∗ = arg min kAx−bk. Given SQ(A),SQ(b), 2 then there is an algorithm that returns an O(sκF log(1/))-sparse vector x such that kx − x∗k ≤ kx∗k with probability at least 0.99 in time 2 O(sκF log(1/)). (25)

13 Proof. In the Kaczmarz iteration (2), since A has row sparsity s, it follows that the calculation of the inner product hArk∗|xki costs O(s) operations. This is also the cost of the k-th step of iteration. 2 After convergence, the total cost is O(T s), where T = O(κF log(1/)). As for the sparsity of xk, we can choose the initial approximation x0 = 0, then xk is at most sk-sparse.

Similarly, based on the randomized coordinate descent method (22), we have

Theorem 15. Suppose that A is symmetric positive definite and s-sparse. Let x∗ = arg min kAx − bk. Given SQ(A),SQ(b), then there is an algorithm that returns an O(Tr(A)kA−1k log(1/))- sparse vector x such that kx − x∗k ≤ kx∗k with probability at least 0.99 in time

O(sTr(A)kA−1k log(1/)). (26)

Differently from quantum and quantum-inspired algorithms, the outputs of the above two ran- domized algorithms are vectors. It is known that if A has row and column sparsity s and given in the sparse access input model, the currently best known quantum algorithm for matrix inversion costs −1 2 O(skAkmaxkA k log (1/)), where kAkmax refers to the maximal entry in the sense of absolute value [4]. The comparison between this quantum algorithm and the randomized algorithm stated in Theorem 14 may be not fair enough because they are building on different models – the quantum algorithm uses the sparse access model, while the later one assumes access to SQ(A). However, if we only concentrate on the final complexity, the quantum linear solver achieves large speedups if kAkF  kAkmax. This can often happen, especially when the rank of A is large (e.g. A is a sparse unitary). However, when the rank of A is polylog in the dimension, then kAkF = Oe(skAkmax) if A is row and column sparse. In this case, the quantum computer only achieves a quadratic speedup. For a row sparse but column dense matrix A, if it is stored in the QRAM data structure, the 2 complexity to solve Ax = b using a quantum computer is O(κF log (1/)) [4]. This only gives a quadratic speedup in terms of κF over Theorem 14. By Theorem 15, when A is symmetric positive definite, the randomized coordinate descent method finds an approximate solution in time O(Tr(A)kA−1k log(1/)). For simplicity, let us consider the special case that A is a sparse density operator, then Tr(A) = 1. Under certain conditions [12], we can efficiently find a unitary such that A is a sub-block. This means that the complexity of the quantum linear solver for this particular linear system is O(kAkkA−1k log2(1/)), while the algorithm based on Theorem 15 costs O(kA−1k log(1/)). So if kAk = Θ(1), it is possible that there is no quantum speedup based on the current quantum linear solvers. In conclusion, quantum linear solvers can build on many different models, e.g. sparse-access input model, QRAM model, etc. Whenever we have an efficient block-encoding, quantum computers can solve the corresponding linear system efficiently. In comparison, so far there is only one model for quantum-inspired algorithms. The above two theorems indicate that to explore large quantum speedups for sparse problems, it is better to focus on other models rather than the QRAM model. This also leads to the problem of exploring other models for quantum-inspired algorithms.

5.2 Reducing the dependence on kAkF

The Frobenius norm can be large for high-rank matrices. For example, when we use the discretiza- tion method (e.g. Euler method) to solve a linear system of ordinary differential equations, the obtained linear system contains the identity as a sub-matrix. So the Frobenius norm can be as large as the dimension of the problem. The performance of quantum-inspired algorithms is highly affected by the Frobenius norm, so currently they are mainly effective for low-rank systems. To

14 enlarge the application scope, it is necessary to consider the problem of how to reduce or even remove the the dependence on the Frobenius norm in the complexity. To the best of our knowledge, there are no generalizations of the Kaczmarz method such that the total cost is independent of kAkF . However, it can be generalized such that kE[xk − x∗]k ≤ −r k (1 − κ ) kx0 − x∗k and each step of iteration is not expensive, where r ∈ {1, 2}. For example, a simple generalization is 2 kAkF ˜ xk+1 = xk + (br − hAer ∗|xki)Aer ∗. kAk2 k k k 2 2 4 Generally, the relaxation parameter kAkF /kAk  2, so this iterative scheme is not convergent, 2 k 2 that is we usually do not have E[kxk − x∗k ]k ≤ ρ kx0 − x∗k for some ρ < 1. However, we do have −2 k 2 kE[xk − x∗]k ≤ (1 − κ ) kx0 − x∗k. The proof is straightforward. If we set T = O(κ log(1/)), then kE[xT − x∗]k ≤ kx0 − x∗k. There are also some other generalizations [25, 27, 32] such that T = O(κ log(1/)), which is optimal for this kind of iterative schemes. One idea to apply this result 1 PL is to generate L samples xT 1,..., xTL such that kE[xT ] − L l=1 xT lk ≤ kE[xT ]k. The matrix Bernstein inequality might be helpful to determine L. However, to apply this method, we have to 2 find tight bounds for kxT − x∗k and E[kxT − x∗k ], which seems not that easy.

6 Outlook

In this paper, for solving linear systems, we reduced the separation between quantum and quantum- 6 6 4 6 2 2 inspired algorithms from κF : κF κ / to κF : κF κ / . It is interesting to consider whether the parameters of our algorithm can be improved further. For the class of iterative schemes including the Kaczmarz iteration, the optimal convergence rate equals the condition number κ (as opposed 2 to κF as in the Kaczmarz method), which is achieved by the conjugate gradient method. However, there is currently no such result for the Kaczmarz iterative schemes themselves. As discussed in Section 5.2, in [25, 27, 32], it was proved that there are generalizations of the Kaczmarz method −1 T so that kE[xT − x∗]k ≤ (1 − κ ) kx0 − x∗k. This convergence result is a little weak but may be helpful to reduce the separation in terms of κF further.

Acknowledgement

We would like to thank Ryan Mann for useful discussions. This paper was supported by the Quan- tERA ERA-NET Cofund in Quantum Technologies implemented within the European Union’s Hori- zon 2020 Programme (QuantAlgo project), and EPSRC grants EP/L021005/1 and EP/R043957/1. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 817581). No new data were created during this study.

Appendix A Estimation of φ

Using the notation (15), we can decompose the updating rule (14) of y as follows

0 yk+1 = yk + zk + zk, (27) 4To make sure that it is convergent, the relaxation parameter should be chosen from (0, 2). Usually, the optimal choice is 1.

15 where ˜ ˜ † brk − hArk∗|A |yki 0 µk zk = erk , zk = erk . kArk∗k kArk∗k

Denote 2 Z = kAx∗ − bk = min kAx − bk, Λ = diag(kAi∗k : i ∈ [m]). (28) x

In the following, for any two vectors a, b, we define ha|biΛ = ha|Λ|bi. To bound φ, it suffices to 2 2 2 2 0 2 bound kyT kΛ. From (27), it is plausible that kyk+1kΛ ≈ kykkΛ + kzkkΛ + kzkkΛ. In the following, we shall prove that in fact this holds up to a constant factor on average. We shall choose the initial vector as 0 for simplicity, i.e., y0 = 0.

First, we consider the case Z = 0, that is b = Ax∗. In the following, we first fix k, rk and compute the mean value over the random variable D, then we compute the mean value over rk by fixing k. Finally, we calculate the mean value over the random variable k. 2 4kAkF T 2 From now on, we assume that d = 2 2 and T = O(κ log(1/)). By Lemma9,  minj∈[n] kA∗j k F

2 2 2 2 2 2 2 0 2 kAkF kxkk  kxkk  (kxk − x∗k + kx∗k ) ED[kzkkΛ] ≤ 2 = ≤ , d minj∈[n] kA∗jk 4T 2T 0 ˜ ˜  ED[hzk|zkiΛ] = brk − hArk∗|xki ED[µk] = 0, 0 ED[hyk|zkiΛ] = kArk∗khyk|erk iED[µk] = 0.

About the norm of zk, we have

m 2 ˜ ˜ † 2 2 X kArk∗k (brk − hArk∗|A |yki) 2 E[kzkkΛ] = 2 2 kArk∗k kAk kArk∗k rk=1 F m 1 X = (b − hA |x i)2 kAk2 rk rk k F rk=1 2 2 2 2 kbk + kAkF kxkk kbk 2 2 ≤ 2 2 ≤ 2 2 + 4kx∗k + 4kx∗ − xkk . kAkF kAkF

As for the inner product between yk and zk, we have the following estimate

m 2 ˜ ˜ † X kArk∗k (brk − hArk∗|A |yki) 2 E[hyk|zkiΛ] = 2 hyk|erk ikArk∗k kAk kArk∗k rk=1 F m 2 X kAr ∗k = k (b − hA |A†|y i)hy |e i kAk2 rk rk∗ k k rk rk=1 F 2 † 2 hb|ykiΛ − kxkkΛ hx∗|A |ykiΛ − kxkkΛ = 2 = 2 kAkF kAkF 2 hx∗|xkiΛ − kxkkΛ 2 = 2 ≤ kx∗kkxkk + kxkk kAkF 2 2 ≤ 3kx∗k + kx∗kkxk − x∗k + 2kxk − x∗k 7 5 ≤ kx k2 + kx − x k2. 2 ∗ 2 k ∗

16 2 In the above, we used the fact that for any two vectors a, b we have |ha|biΛ| ≤ kΛk |ha|bi|, and 2 2 kΛk ≤ kAkF . Hence, we have

2 2 2 0 2 E[kyk+1kΛ] = E[kykkΛ] + E[kzkkΛ] + E[kzkkΛ] + 2E[hyk|zkiΛ] 2 2 2 2 kbk  2  2 ≤ E[kykkΛ] + 2 2 + (9 + )E[kxk − x∗k ] + (11 + )kx∗k kAkF 2T 2T 2 2 2 kbk  2 ≤ E[kykkΛ] + 2 2 + (20 + )kx∗k , kAkF T

2 2 where we use that E[kxk − x∗k ] ≤ kx∗k by Lemma2. Therefore,  2 2  2 kbk  2 E[kyT kΛ] ≤ T 2 2 + (20 + )kx∗k . kAkF T This means that, with high probability,

2  2 2 2  kyT kΛ 2 kbk  kx∗k 2 φ = T 2 ≤ T 2 2 2 + (20 + ) 2 = O(T ). kxT k kAkF kxT k T kxT k

When Z 6= 0, then b = Ax∗ + c for some vector c of norm Z that is not in the column space of A. This can happen when A is not full rank. Since c is independent of A, we cannot bound it in terms of A. In this case, the only change is E[hyk|zkiΛ], which is now bounded by

hc|E[yk]iΛ 7 2 5 2 E[hyk|zkiΛ] ≤ 2 + kx∗k + kxk − x∗k kAkF 2 2 2 kAk 7 2 5 2 ≤ 2 kckkE[yk]k + kx∗k + kxk − x∗k kAkF 2 2 2 κ Z 7 2 5 2 = 2 kE[yk]k + kx∗k + kxk − x∗k . κF 2 2 At the end, we obtain

2 2 2 2 2 kbk  2 2κ Z E[kyk+1kΛ] ≤ E[kykkΛ] + 2 2 + (20 + )kx∗k + 2 kE[yk]k. kAkF T κF From (27), we know that  AA†  b E[yk+1] = I − 2 E[yk] + 2 . kAkF kAkF This means

k−1 i k−1 X  AA†  b X kbk κ2 kbk k [y ]k = I − ≤ (1 − κ−2)i ≤ F . E k kAk2 kAk2 F kAk2 kAk2 i=0 F F i=0 F F Therefore,

 2 2 2  2 2kbk  2 2κ kbkZ E[kyT kΛ] ≤ T 2 + (20 + )kx∗k + 2 kAkF T kAkF

17  2 2 2 2  2kAx∗k  2 2κ kbkZ + 2Z = T 2 + (20 + )kx∗k + 2 . kAkF T kAkF Finally, by Markov’s inequality, with high probability we have 2  2 2  kyT kΛ 2 2 κ kbkZ + Z φ = T 2 = O T + T 2 2 , kxT k kAkF kx∗k 2 2 2 where we used the fact that xT ≈ x∗ and kAx∗k /kAkF ≤ kx∗k . Note that kAkF kx∗k = −1 2 4 2 kAkF kA bk ≥ kAkF kbk/kAk ≥ Z, so the second term is bounded by O(T κ Z/κF kbk). In 2 2 2 2 the worst case, φ = O(T κ ) because κ Z ≤ κF kbk.

Appendix B Estimation of the convergence rate

For simplicity, we assume that Ax∗ = b. Following from the analysis of [26], the convergence rate does not change too much if the LSP is not consistent. By the updating formula (19), we have 1 X 1 X x − x = x − x + (˜b − hA˜ |x i)|A˜ i + hA˜ |I − D |x i|A˜ i k+1 ∗ k ∗ 2 i i∗ k i∗ 2 i∗ i k i∗ i∈Tk i∈Tk   1 X 1 X = I − |A˜ ihA˜ | (x − x ) + hA˜ |I − D |x i|A˜ i.  2 i∗ i∗  k ∗ 2 i∗ i k i∗ i∈Tk i∈Tk 2 Below, we try to bound E[kxk+1 − x∗k ]. We will follow the notation of (15). But here to avoid any confusion, we denote µik := hA˜i∗|I − Di|xki. 2 2 In the following, the result of Lemma9 will be used, and |Tk| = kAkF /kAk . First, we have  2

X ˜ ˜ X ˜ ˜ ED  hAi∗|I − Di|xki|Ai∗i  = hAi∗|Aj∗iEDi,Dj [µikµjk]

i∈Tk i,j∈Tk X 2 X ˜ ˜ = EDi [µik] + hAi∗|Aj∗iEDi [µik] EDj [µjk] i∈Tk i6=j 4 2 kAkF kxkk ≤ 2 2 . dkAk minl∈[n] kA∗lk By Lemma9, we have *  + 1 X X I − |A˜ ihA˜ | (x − x ) µ |A˜ i ED   2 i∗ i∗  k ∗ ik i∗  i∈Tk i∈Tk *  + 1 X X = I − |A˜ ihA˜ | (x − x ) [µ ]|A˜ i = 0.  2 i∗ i∗  k ∗ ED ik i∗ i∈Tk i∈Tk Hence, after computing the mean value over D, we have   2 4 2 2 1 X ˜ ˜ kAkF kxkk kxk+1 − x∗k ≤ I − |Ai∗ihAi∗| (xk − x∗) + 2 2 . 2 4dkAk minl∈[n] kA∗lk i∈Tk

18 For the first term, we now compute its mean value over the random variable Tk    2

1 X ˜ ˜ E  I − |Ai∗ihAi∗| (xk − x∗)  2 i∈Tk  2 1 X = hx − x | I − |A˜ ihA˜ | |x − x i. k ∗ E  2 i∗ i∗   k ∗ i∈Tk

And it can be shown that  2 1 X I − |A˜ ihA˜ | E  2 i∗ i∗   i∈Tk   3 X 1 X = I − |A˜ ihA˜ | + |A˜ ihA˜ |A˜ ihA˜ | E  4 i∗ i∗ 4 i∗ i∗ j∗ j∗  i∈Tk i,j∈Tk,i6=j †  † 2 3q A A 1 2 A A = I − 2 + (q − q) 2 4 kAkF 4 kAkF †  † 2! 3q σmin(A A) 1 2 σmin(A A)  1 − 2 + (q − q) 2 I 4 kAkF 4 kAkF  1   1 − I. 2κ2

2 2 2 In the last step, we used the result q = kAkF /kAk so that the second term is 3/4κ and the third term is less than 1/4κ4 ≤ 1/4κ2. Therefore,

4 2 2 1 2 kAkF E[kxkk ] E[kxk+1 − x∗k ] ≤ (1 − 2 )E[kxk − x∗k ] + 2 2 . 2κ 4dkAk minl∈[n] kA∗lk

Now set T = O(κ2 log(2/2)), and

4 kAkF T d = 2 2 2 . (29)  kAk minl∈[n] kA∗lk

Then 1 2 [kx − x k2] ≤ (1 − ) [kx − x k2] + [kx k2] E k+1 ∗ 2κ2 E k ∗ 4T E k 1 2 2 ≤ (1 − + ) [kx − x k2] + kx k2. 2κ2 2T E k ∗ 2T ∗ Finally, we obtain

1 2 2 [kx − x k2] (1 − + )T kx − x k2 + kx k2 ≤ 2kx k2. E T ∗ . 2κ2 2T 0 ∗ 2 ∗ ∗

19 Appendix C Estimation of φ for Kaczmarz method with averaging

The calculation here is similar to that in AppendixA. For simplicity, denote 1 1 y = y + w + w0 , k+1 k 2 k 2 k where

X bi − hAi∗|xki 0 X µik wk := 2 ei∗, wk := ei∗. kAi∗k kAi∗k i∈Tk i∈Tk

In this section, d is given in formula (29). From a similar estimation in AppendixA, we have 0 0 ED[hwk|wkiΛ] = ED[hyk|wkiΛ] = 0 and

2 2 2 2 2 2 2 2 0 2 kAkF kAkF kxkk  kxkk  (kxk − x∗k + kx∗k ) ED[kwkkΛ] ≤ 2 2 = ≤ . kAk d minj∈[n] kA∗jk 4T 2T

2 As for kwkkΛ, we still have   2 X bi − hAi∗|xki bj − hAj∗|xki 2 E[kwkkΛ] = E  2 2 hei∗|ej∗ikAi∗k  kAi∗k kAj∗k i,j∈Tk 2  2  kAkF (bi − hAi∗|xki) = 2 E 2 kAk kAi∗k kb − Ax k2 = k kAk2 2kbk2 ≤ + 2kx k2. kAk2 k

By the estimation of E[hyk|zkiΛ] in AppendixA and noting that kΛk ≤ kAk, we also have

2 kAkF bi − hAi∗|xki 7 2 5 2 E[hyk|wkiΛ] = 2 E[hyk| 2 ei∗i] ≤ kx∗k + kxk − x∗k . kAk kAi∗k 2 2

All the estimations above do not change. The constant 1/2 in the decomposition of yk+1 does not affect the upper bound, so when Z = 0 we have

 2 2  2 kbk  2 E[kyT kΛ] ≤ T 2 2 + (20 + )kx∗k , kAkF T where T = Oe(κ2). Therefore

2 2 2  2 2 2  κF kyT kΛ 2 κF kbk  kx∗k 2 2 φ = T 2 2 ≤ T 2 2 2 2 + (20 + ) 2 = O(κF κ ). κ kxT k κ kAkF kxT k T kxT k

2 4 When Z 6= 0, we similarly have φ = O(κF κ ).

20 References

[1] Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. Quantum- inspired algorithms in practice. Quantum, 4:307, 2020. arXiv:1905.10415. [2] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195–202, 2017.

[3] Yair Censor, Stavros Andrea Zenios, et al. Parallel optimization: Theory, algorithms, and applications. Oxford University Press on Demand, 1997.

[4] Shantanav Chakraborty, Andr´asGily´en,and Stacey Jeffery. The Power of Block-Encoded Matrix Powers: Improved Regression Techniques via Faster Hamiltonian Simulation. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), pages 33:1–33:14, 2019. arXiv:1804.01973. [5] Nadiia Chepurko, Kenneth L Clarkson, Lior Horesh, and David P Woodruff. Quantum-inspired algorithms from randomized numerical linear algebra. arXiv preprint arXiv:2011.04125, 2020.

[6] Nai-Hui Chia, Andr´asGily´en,Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC 2020), pages 387–400, 2020. arXiv:1910.06151. [7] Nai-Hui Chia, Tongyang Li, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired clas- sical sublinear-time algorithm for solving low-rank semidefinite programming via sampling approaches. 2019. arXiv:1901.03254. [8] Andrew M Childs, Robin Kothari, and Rolando D Somma. Quantum algorithm for systems of linear equations with exponentially improved dependence on precision. SIAM Journal on Computing, 46(6):1920–1950, 2017.

[9] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158–183, 2006.

[10] Andr´asGily´en,Seth Lloyd, and Ewin Tang. Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension. 2018. arXiv:1811.04909. [11] Andr´asGily´en,Zhao Song, and Ewin Tang. An improved quantum-inspired algorithm for . 2020. arXiv:2009.07268. [12] Andr´asGily´en,Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 193–204, 2019.

[13] Gene H. Golub and Charles F. van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, fourth edition, 2013.

[14] Richard Gordon, Robert Bender, and Gabor T Herman. Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and x-ray photography. Journal of theoretical Biology, 29(3):471–481, 1970.

21 [15] Robert M Gower and Peter Richt´arik.Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015.

[16] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Phys Rev Lett, 103(15):150502, 2009. arXiv:0811.3171.

[17] Dhawal Jethwani, Franccois Le Gall, and Sanjay K Singh. Quantum-inspired classical algo- rithms for singular value transformation. 2019. arXiv:1910.05699.

[18] S Karczmarz. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., pages 355–357, 1937.

[19] Iordanis Kerenidis, Jonas Landman, Alessandro Luongo, and Anupam Prakash. q-means: A quantum algorithm for unsupervised machine learning. In Advances in Neural Information Processing Systems, pages 4134–4144, 2019.

[20] Iordanis Kerenidis and Anupam Prakash. Quantum Recommendation Systems. In 8th In- novations in Theoretical Computer Science Conference (ITCS 2017), pages 49:1–49:21, 2017. arXiv:1603.08675.

[21] Iordanis Kerenidis and Anupam Prakash. A quantum interior point method for LPs and SDPs. ACM Transactions on , 1(1):1–32, 2020.

[22] Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning. Mathematics of Operations Research, 35(3):641–654, 2010.

[23] Lin Lin and Yu Tong. Optimal polynomial based quantum eigenstate filtering with application to solving quantum linear systems. Quantum, 4:361, 2020.

[24] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10(9):631–633, 2014. arXiv:1307.0401.

[25] Nicolas Loizou and Peter Richt´arik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3):653–710, 2020.

[26] Jacob D Moorman, Thomas K Tu, Denali Molitor, and Deanna Needell. Randomized Kaczmarz with averaging. 2020. arXiv:2002.04126.

[27] Ion Necoara. Faster randomized block Kaczmarz algorithms. SIAM J Matrix Anal Appl, 40(4):1425–1452, 2019. arXiv:1902.09946.

[28] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochas- tic approximation approach to stochastic programming. SIAM J Optim, 19(4):1574–1609, 2009.

[29] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim, 22(2):341–362, 2012.

[30] Davide Orsucci and Vedran Dunjko. On solving classes of positive-definite quantum lin- ear systems with quadratically improved runtime in the condition number. arXiv preprint arXiv:2101.11868, 2021.

22 [31] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Phys Rev Lett, 113(13):130503, 2014. arXiv:1307.0471.

[32] Peter Richt´arikand Martin Tak´ac. Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41(2):487–524, 2020.

[33] Changpeng Shao and Hua Xiang. Row and column iteration methods to solve linear systems on a quantum computer. Phys Rev A, 101(2):022322, 2020. arXiv:1905.11686.

[34] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. J Fourier Anal Appl, 15(2):262, 2009. arXiv:math/0702226.

[35] Ewin Tang. Quantum-inspired classical algorithms for principal component analysis and su- pervised clustering. 2018. arXiv:1811.00414.

[36] Ewin Tang. A quantum-inspired classical algorithm for recommendation systems. In Proceed- ings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC 2019), pages 217–228, 2019. arXiv:1807.04271.

23