Arxiv:1303.6034V2 [Cs.MS] 15 Apr 2013 Prtos M N Prlbaisaeue.A Extension an ﬂoating-Poin Used

Home , Erable

arXiv:1303.6034v2 [cs.MS] 15 Apr 2013 prtos M n PRlbaisaeue.A extension An floating-poin used. basic are ZKCM For libraries library MPFR decompositio Hermitian- and transform. GMP the singular-value Fourier operations, discrete decomposition, the the LU and diagonalization, the t e.g., product, matrix physics, inner opera matrix of tracing-out the the provides studies product), (Kronecker numerical It product tensor for useful computation. a operations matrix for multiprecision demand for syntax. potential user-friendly method: usable a a Solution computation with matrix purpose is this on There for focusing resul library simulation programming physics. of numerical accuracy the in evaluate and/or guarantee problem: of Nature later or 3.0.0 Ver. routines/libraries: External Computing Quantum information Quantum Classification: state, product matrix dependent Keywords: instance problem the etc. RAM: OS, Windows on Cygwin system: Operating Computer: language: Programming 3 provisions: Licensing identifier: Catalogue Reference: Journal Title: Program informati quantum in Authors: applications with computation matrix Title: Manuscript fe sdi ueia iuain nqatmpyis t extensio Its physics. quantum in simulations matr conven numerical and multiprecision in syntax of easy-to-use used an purpose often provides the It for libraries. developed MPFR library and C++ a is ZKCM Abstract -aladdress: E-mail rpitsbitdt optrPyisCommunications Physics Computer to submitted Preprint bu h irre ihpatclsml programs. Keywords: sample practical with libraries the about matrix-product-st time-dependent the using computing quantum KM + irr o utpeiinmti optto w computation matrix multiprecision for library C++ a ZKCM: a RGA SUMMARY PROGRAM ∗ unu nomto cec hoyGop ainlInstit National Group, Theory Science Information Quantum orsodn author. Corresponding eea eabts-svrlgg ye,dpneton dependent bytes, giga several - bytes mega Several kr SaiToh Akira eea computers General utpeiincmuig ieragba Time- algebra, Linear computing, Multiprecision utpeiincmuig ieragba iedpnetmti pro matrix time-dependent algebra; linear computing; multiprecision [email protected] Chsas endvlpd hc employs which developed, been also has QC ZKCM . ierEutosadMtie,4.15 Matrices, and Equations Linear 4.8 + irr KMhsbe developed been has ZKCM library C++ A nxlk ytm,sc sLnx reBSD, Free Linux, as such systems, Unix-like KM + irr o multiprecision for library C++ a ZKCM: utpeiincmuaini epu to helpful is computation Multiprecision N esrGnrlPbi ies Ver. License Public General Lesser GNU C++ N P(M)[] PR[2] MPFR [1], (GMP) MP GNU unu information quantum kr SaiToh Akira tion, on he t fIfrais -- iosbsi hyd,Tko101 Tokyo Chiyoda, Hitotsubashi, 2-1-2 Informatics, of ute n, ts t 2 .Fousse L. Library,[2] Arithmetic Precision Multiple GNU The [1] References CPU. aforementioned the using precision 512-bit 4 a by described system 2 quantum for directory spectrum the DFT in time: placed and Running are “doc/html” directories programs “samples”. the i Sample explained in is found “doc/latex”. function manual Each reference package. a the of matrix “doc” of directory kind comments: slower times any Additional thousand for a computation computation. often is double-precision precision than bit thousand half Restrictions: simula to computing. method quantum matrix-product-state time-dependent the r epu nti ead mn hm h irr named library the which them, e.g., computing, Among high-precision libraries, regard. for this 7], programming in 6, helpful several 5, are af- 4, are 3, considerably phenom- There [2, Refs. physical elements investigated [1]. matrix for ena results or er- numerical rounding variables fect when in physics simulation rors in concern serious Introduction 1. ih25Gzo oeCUfeuny ttkstret five to three 100 takes It a CPU Intel frequency. diagonalize or CPU to AMD more recent or minutes use GHz we 2.5 when with precision 256-bit for t iuainmto.Ti ae ie nintroduction an gives paper This method. simulation ate rcso fflaigpitoeain ssmtmsof sometimes is operations floating-point of Precision http://gmplib.org/. 3(07 3 http://www.mpfr.org/. Softw Math. 13, Trans. (2007) ACM 33 rounding, correct with library point a, etfntosfrmti aiuain nldn those including manipulations matrix for functions ient ∗ irr,ZKCM library, n xcmuain ntebsso h N MP GNU the of basis the on computation, ix tal. et utpeiincmuainwt oeta a than more with computation Multiprecision ttksls hntit eod ooti a obtain to seconds thirty than less takes It PR utpepeiinbnr floating- binary multiple-precision A MPFR: , utsae unu information quantum state; duct 16 aapit fatm vlto fa of evolution time a of points data srsmna spae nthe in placed is manual user’s A C sdvlpdfrsimulating for developed is QC, × t plctosin applications ith 0 emta arxfor matrix Hermitian 100 × arxHamiltonian matrix 4 83,Japan -8430, uy3,2018 30, July are te n ZKCM [7], which we have been developing, is a C++ li- trigonometric functions are defined for the class. The lat- brary for multiprecision complex-number matrix computa- ter class is a class of a matrix. Standard operations and tion. It provides several functionalities including the LU functions like matrix inversion are defined. In addition, the decomposition, the singular value decomposition, the ten- singular-value decomposition of a general matrix, the di- sor product, and the tracing-out operation and an easy- agonalization of an Hermitian matrix, the discrete Fourier to-use syntax for basic operations. It is based on the GNU transform, etc., are defined for the class. Functoins for the MP (GMP) [8] and MPFR [9] libraries, which are com- tensor product (Kronecker product) and the tracing-out monly included in recent distributions of UNIX-like sys- operation (e.g., one can trace out the subsystem B of systems. tem ABC) are also defined. A detailed document is placed There is an extension library named ZKCM QC. This in the “doc” directory of the package of ZKCM. library is designed for simulating quantum computing [10, As for installation of the library, the standard process 11] by the time-dependent matrix-product-state method “./configure make sudo make install” works in [12] [or, simply referred to as the matrix-product-state most cases; otherwise→ the→ document should be consulted. (MPS) method]. It uses a matrix product state [13, 14, 12] We will next look at simple examples to demonstrate to represent a pure quantum state. The MPS method is re- the programming style using the library. (Here, the latest cently one of the standard methods for simulation-physics stable version, ver. 0.3.2, is used.) software [15]. As for other methods effective for simulating quantum computing, see, e.g., Refs. [16, 17]. With ZKCM QC, one may use quantum gates in U(2), U(4), 2.1. Program examples and U(8) as elementary gates. Indeed, in general, quan- Example 1. There is a classical example [20] often men- tum gates in U(2) and U(4) are enough for universal quan- tioned to recall the importance of multiprecision compu- tum computing [18], but we regard quantum gates in U(8) tation. It clarifies that a sufficiently large precision is re- also as elementary gates so as to reduce computational quired even for a simple linear equation with only two overheads in circuit constructions. variables. A simulation of quantum computing with MPS is known Consider the linear equation for its computational efficiency in case the Schmidt ranks are kept small during the simulation [12, 19]. Even for x 1 the case slightly large Schmidt ranks are involved, it is not A = y 0 as expensive as a simple simulation. The theory of MPS simulation will be briefly explained in Sec. 4.1. Numeri- with cal errors will be phenomenologically discussed in Sec. 4.3, which will give a certain reason why we introduce multi- 64919121 159018721 A = . precision computation for an MPS simulation of quantum 41869520.5 −102558961 computing. − This contribution is intended to provide a useful intro- The exact solution is x = 205117922,y = 83739041. duction for programming with the libraries. Section 2 de- One way to numerically find the solution is to com- t 1 t scribes two examples of the use of the ZKCM library: one pute (x, y) = A− (1, 0) using the matrix inversion. An- for showing the precision dependence of a solution of a other way is to utilize the Gaussian elimination (see, e.g., simple linear equation; the other for simulating an NMR Refs. [21, 22]). We used functions “inv” and “gauss” of spectrum in a simple model. Performance evaluation of ZKCM for the former and the latter ways, respectively (as the library is made in Sec. 3. Section 4 shows an overview for pivoting, a partial pivoting is used in “gauss”). Then, of the theory of the MPS method and an example of sim- we plotted the computed value of x against the precision ulating a simple quantum circuit using the ZKCM QC li- [bits] as shown in Fig. 1. (Here, we refer to the number of brary. In addition, later in the section numerical errors in bits for a significand of a floating-point number as preci- an MPS simulation of quantum computing are examined sion.1) We can see that low-precision computing resulted using a certain setup of a quantum algorithm. We discuss in a wrong answer while high-precision computing resulted on the effectiveness and the performance of our libraries in a correct answer. It should be emphasized that the dou- in Sec. 5. Section 6 summarizes our achievements. ble precision (namely, 53 bits for the mantissa portion of a floating-point number) is not enough in this example. 2. ZKCM Library 1Precision is defined as an exact length of the mantissa portion The ZKCM library is designed for general-purpose ma- of a floating-point number in MPFR while it is defined as a least trix computation. This section concentrates on its main length in GMP [9]. In ZKCM, a complex number keeps each part library. It has two major C++ classes: “zkcm_class” internally as an MPFR variable so that precision is an exact length. However, it should be noted that an output from a function usually and “zkcm_matrix”. The former class is a class of a com- carries over the best precision among those of involved variables in plex number. Many operators like “+=” and functions like ZKCM. 2 3.0×108 }

2.5×108 Listing 1: classic example.cpp

8 2.0×10 The program is compiled and executed in a standard way.2 1.5×108 Its output file “classic example.dat” can be visualized by Exact solution 8 Gnuplot [23] using the file “classic example.gnuplot” found x 1.0×10 Computed value (i) Computed value (ii) in the “samples” directory. 5.0×107 In the program, one can see typical fea- 0.0×100 tures of the ZKCM library. The line “zkcm_set_default_prec(prec);” sets the default -5.0×107 internal precision (the default bit length of a significand) -1.0×108 40 50 60 70 80 90 100 110 120 of an object to “prec”. Any object like “A” (this is a Precision [bit] 2 2 matrix) constructed without specified precision possesses× the default precision. It is possible to specify a Figure 1: Plot of the computed values of x against the precision (the particular precision (i.e., in case of a matrix object, the bit length of a mantissa) employed for floating-point numbers. Ei- length of the significand for each part of each element ther (i) the matrix inversion (pluses) or (ii) the Gaussian elimination (circles) was used. For case (ii), there are data points with x values of the matrix), say, 512, by constructing the object < −1.0 × 108 for some values of precision < 50; they are outside the as “zkcm_matrix A(2, 2, 512)”, for instance. For range in the figure. a matrix object, say, “A”, it is convenient to access its (i, j) element by “A(i,j)” that is a reference to the element. The element is an object in the type of The program used for this example is shown in Listing 1. “zkcm_class”. To assign a value into a “zkcm_class” It is found as “samples/classic example.cpp” in the pack- object (a complex number) “z”, one can write “z=...;” age. It is written to use the matrix inversion. To use the where the right-hand side can be a number or a string Gaussian elimination, the line “B = inv(A) * B;” should describing a complex number in the style "___+___*I" be replaced with “B = gauss(A, B);”. (PARI/GP style [24]; here, “I” stands for i = √ 1), "___+___i", "___+___j" (here, “j” stands for i),− etc. (there are other acceptable styles). In the program, by #include "zkcm.hpp" #include “A(0, 0) = "64919121";” the value 64919121 + 0i is int main() assigned to the (0, 0) element of matrix A. Other values { are assigned to corresponding elements in a similar way. std::ofstream ofs("classic_example.dat"); The solution of the linear equation is obtained by using if (!ofs) { the function “inv” or “gauss”. The obtained result std::cerr << "Could not create the file\ is written into the output file stream “ofs” by using classic_example.dat" the operator “<<”. In the program, the elements are << std::endl; individually written out so as to meet the data format of return -1; } a data-plotting program. After computation is performed, ofs << "# prec x y" << std::endl; the program should be terminated. At this stage, it is for (int prec = 32; prec < 257; prec++) recommended to write “zkcm_quit();” so as to release { zkcm_set_default_prec(prec); miscellaneous memories allocated for internal use of zkcm_matrix A(2, 2), B(2, 1); background library functions.

A(0, 0) = "64919121"; Example 2. As the second example, a sample program A(0, 1) = "-159018721"; “NMR spectrum simulation.cpp” found in the “samples” A(1, 0) = "41869520.5"; A(1, 1) = "-102558961"; directory of the package of ZKCM is explained. This program generates a simulated free-induction-decay (FID) B(0, 0) = 1; spectrum of liquid-state NMR for the spin system consist- B(1, 0) = 0; ing of a proton spin with precession frequency w1 = 400 13 B = inv(A) * B; MHz (variable “w1” in the program) and a C spin with //Replace the above line with precession frequency w2 = 125 MHz (variable “w2”) at //B = gauss(A, B); room temperature (300 K) (variable “T”). A J coupling //to use the Gaussian elimination instead. constant J12 = 140 kHz (variable “J12”) is considered for ofs << prec << " " the spins. << B(0, 0) << " " << B(1, 0) << std::endl; } 2To make an executable file, the library flags typically “-lzkcm -lm zkcm_quit (); -lmpfr -lgmp -lgmpxx” are required. As for ZKCM QC, additionally return 0; “-lzkcm qc” should be specified. 3 The first line of the program is to include a header file zkcm_matrix X1(4, 4), Yhpi1(4, 4); of ZKCM: X1 = tensorprod(X, I); Yhpi1 = tensorprod(Yhpi, I); #include "zkcm.hpp" int main(int argc, char *argv[]) To get an FID data, we firstly tilt the proton spin by the { ideal Y90 pulse.

In the beginning of the “main” function, the default pre- rho = Yhpi1 * rho * adjoint(Yhpi1); cision is set to 280 bits by Now the data of time evolution of of the proton zkcm_set_default_prec(280); spin under the Hamiltonian H is recorded for the time duration N dt using In the subsequent lines, some matrices like Pauli matrices × I and X, etc. are generated. For example, X is generated array = rec_evol(rho, H, X1, dt, N); as We now use a zero-padding for this “array” so as to en- zkcm_matrix X = "[0, 1; 1, 0]"; hance the resolution. This will extend the array by N using a string representing a matrix in the PARI/GP style. zeros. The Y pulse is generated as 90 array2 = zero_padding(array, 2*N); zkcm_matrix Yhpi(2,2); Yhpi(0,0) = sqrt(zkcm_class(0.5)); To obtain a spectrum, the discrete Fourier transform is Yhpi(0,1) = sqrt(zkcm_class(0.5)); applied. Yhpi(1,0) = -sqrt(zkcm_class(0.5)); Yhpi(1,1) = sqrt(zkcm_class(0.5)); array2 = abs(DFT(array2)); Other matrices are generated by similar lines. After this, This “array2” is output to the file “example zp.fid” as an FID spectrum data with df = 1/(2 N dt) as the values of constants and parameters are set. For example, × × the Boltzmann constant kB [J/K] is generated as frequency interval, in the Gnuplot style by zkcm_class kB("1.3806504e-23"); GP_1D_print(array2, 1.0/dt/zkcm_class(2*N), 1, "example_zp.fid"); The Hamiltonian H of the spin system is generated in the type of “zkcm_matrix” and is specified as At last, the function “main” ends with “zkcm_quit();” and “return 0;”. The program is compiled and executed H = w1 * tensorprod(Z/2,I) + w2 * tensorprod(I,Z/2) in a standard way. + J12 * tensorprod(Z/2,Z/2); The result stored in “example zp.fid” is visualized by This is used to generate a thermal state ρ: Gnuplot using the file “example zp.gnuplot” placed in the directory “samples”, as shown in Fig. 2. zkcm_matrix rho(4,4); rho = exp_H((-hplanck/kB/T) * H); 0.6 rho /= trace(rho); Here, “exp_H” is a function to calculate the exponential of 0.5 an Hermitian matrix and “hplanck” is the Planck constant 34 0.4 (6.62606896 10− Js). The sampling time interval dt × to record the value of for the proton spin is set 0.3 to 0.43/w1 (any number smaller than 1/2 might be fine instead of 0.43) by 0.2 zkcm_class dt(zkcm_class("0.43")/w1); 0.1 The number of data to record is then decided as 0 398.5 399.0 399.5 400.0 400.5 401.0 401.5 int N = UNP2(8.0/dt/J12); Frequency [MHz] Here, function “UNP2” returns the integer upper nearest power of 2 for a given number. Now arrays to store data Figure 2: Plot of the simulation data stored in “example zp.fid”. See are prepared as row vectors. the text for the details of this simulation. zkcm_matrix array(1, N), array2(1, N); It should be noted that we have not employed The following lines prepare the X and Y90-pulse operators a high-temperature approximation. Under a high- acting only on the proton spin. temperature approximation, the first order deviation βH − 4 of exp( βH) (here, β = h/(kBT )) is considered as a devi- number m is small: O(m) rational-number operations are ation density− matrix and calculations are performed using enough3 (we make use of the C++ class “mpq_class” of the normalized deviation density matrix H/const. This GMP for internally handling rational numbers). approximation is commonly used [28] but it− cannot be used Now, the speed of computing Γ(z) using “gamma” is eval- for simulations for low temperature. An advantage of us- uated for z = a + bi with random a,b [ 10, 10]. It is ing ZKCM for simulating NMR spectra is that the tem- compared with the function with the same∈ name,− “gamma”, perature can be chosen. This is possible because of high of the PARI library. The PARI library uses a similar al- accuracy in computing the exponential of a Hamiltonian. gorithm to compute Γ(z). The main difference is that As we have seen in this example, ZKCM has useful func- PARI can utilize cached values of Bernoulli numbers that tions to handle matrices, especially those often used in are once calculated in any function during a process is simulations in quantum physics. For the list of available running. It can also use precomputed values of typical functions, see the reference manual found in the package. constants. As shown in Table 1, the average computation time of Γ(z) in the ZKCM library is on the same order of magni- 3. Performance evaluation tude as that in the PARI library in the absence of cached We evaluate the performance of ZKCM (version 0.3.2) values of Bernoulli numbers. However, in the presence in comparison with PARI [24] (version 2.5.3 with the of cached values of Bernoulli numbers, the computation GMP kernel), a conventional highly-evaluated C library speed of Γ(z) in PARI is quite much faster. for mathematical computing. 3.2. Speed of matrix multiplication 3.1. Speed of computing the Γ function Here, we evaluate the speed of matrix multiplication. We first generate a 100 100 random Hermitian matrix A Here, we evaluate the performance of the function × satisfying √TrAA† = 1. We then continue to compute the “gamma” of ZKCM, which calculates Γ(z) for a complex 2 operations (i) A A and (ii) A A/√TrAA† sequen- number z. It is implemented on the basis of the expansion tially, i.e., (i) (ii)← (i) (ii) ... from left← to right. Operations using the Stirling formula for Re z 0: ≥ are performed under a specified precision “prec”. We compare time consumption to perform ten sets of “(i) (ii)”. L n 1 1 2n z z 1 ( 1) − B2n z − Time spent for the matrix preparation is not included. As Γ(z) e− z − √2πz exp − | | + RL , ≃ 2n(2n 1) shown in Table 2, computation time is on the same or- "n=1 − # X der of magnitude for ZKCM and PARI and discrepancy where Bm is the first or the second Bernoulli number (as in time consumption is not large. This result is plausible the absolute value is used, either is fine); L is a sufficiently since both the libraries employ a straight-forward way to large (but not very large) integer to guarantee the accu- implement matrix multiplication and rely on the GMP li- racy; RL is the remainder of the sum (corresponding to the brary for the speed of primitive operations. (ZKCM uses sum of the terms for n = L +1,..., ). It is known that MPFR for basic operations that is based on GMP; hence ∞ the expansion is not convergent since B2n grows rapidly ZKCM indirectly uses primitive operations of GMP.) for large n. It was, however, proved| by Spira| [25] that B2L 1 2L 3.3. Speed of Hermitian-matrix diagonalization RL 2| L 1| z − holds when Re z 0. Thus the ex- |pansion| ≤ is− usable| | for computation with≥ enough accuracy We next evaluate the speed of diagonalization of an as long as z is appropriately scaled to a large value in Hermitian matrix for finding all the eigenvectors. The advance. We internally scale z to satisfy Re z 0 and ≥ 3 z 103 210. In accordance with this scaling, L is ap- For other use than computing Γ(z), we cannot use this as- | |≥ ∼ sumption in general. In this case, Akiyama-Tanigawa algorithm [26] propriately chosen so that RL becomes small enough for is employed to calculate the Bernoulli number. This takes O(m2) required precision. For the scaling, the well-known formu- rational-number operations. lae Γ(z) = Γ(z + k)/[z(z + 1) (z + k 1)] (here, k is 4 ··· − This was because the improvement in running time due to set- an integer) and Γ(z)= π/[ zΓ( z) sin( πz)] (here, z is ting compiler optimization flags was negligible. (Of course, there not a pole) are utilized.− To− compute− Bernoulli− numbers, were other compiler flags for library specifications etc. which are not of our concern in this context.) For instance, a program to find we utilize the relation eigenvectors of a normalized random 50 × 50 Hermitian matrix using n 1 the function “diag H” of ZKCM (“eigen” of PARI) with 768-bit pre- − n Bk cision consumed 34.558 (14.550) seconds on average when compiled Bm =1 − k n k +1 without optimization flags, and consumed 34.314 (14.551) seconds k=0 − on average when compiled with the optimization flags “-O3 -fno- X strict-aliasing -fomit-frame-pointer”. (Here, we used the Frobenius and the fact that B2j+1 vanishes for integer j 1. We keep norm for normalization.) The average was taken over 100 trials. The the computed values of binomial coefficients and≥ Bernoulli standard deviations were 0.0648 (0.143) and 0.0645 (0.122), respec- numbers inside a certain subblock of a function for com- tively. The compiler was GCC ver. 4.6.3 installed by default on the operating system Fedora 15 64-bit. A computer with the Intel Core puting Γ(z). Therefore, the cost to compute Bm assum- i5 M460 2.53 GHz CPU (maximum clock frequency 2.80GHz) and ing that we have already computed B0,...,Bm 2 for even 4GB RAM was used. − 5 Table 2: Comparison of the average real time consumption to perform ten sets of operations (i) and (ii) (see the text). The average was taken over ten different A’s (again, see the text). The standard deviation is shown in parentheses in small fonts. “prec” stands for the precision (the bit length of a significand). Environments: Fedora 15 OS on Intel Core i5 M460 CPU with 4GB RAM for upper entries () and Fedora 16 OS on AMD FX-8120 CPU with 16GB RAM for lower entries (♣).

Table 1: Comparison of the average real time consumption for com- prec ZKCM [sec] PARI [sec] z z a bi a, b ∈ − , 26.1 (0.115) 15.0 (0.204) puting Γ( ) for = + with random [ 10 10]. The average 256 was taken over 1,000 trials. No I/O operation was involved. Time ♣21.4 (0.248) 13.0 (0.567) for generating z was negligible (this was no more than 2.0 × 10−5 35.5 (0.206) 21.9 (0.0208) seconds). The standard deviation (we employ the sample standard 512 deviation) for each entry is shown in parentheses in small fonts un- 27.3 (0.202) 18.7 (0.188) der the entry. “prec” stands for the precision (the bit length of 46.4 (0.161) 30.8 (0.0702) 768 each mantissa) employed in a program. In a program using PARI, 36.9 (0.266) 28.1 (0.247) function “nbits2prec” was used to convert the precision into the 65.8 (0.115) 50.5 (0.199) number of code words that is defined as the precision for variables in 1024 PARI. ZKCM version 0.3.2 and PARI version 2.5.3 were used. They 53.2 (0.178) 41.3 (0.434) were compiled with GCC/G++ with their default optimization flags, 86.5 (0.0265) 67.8 (0.118) while test programs for this performance evaluation were compiled 1280 4 69.3 (0.181) 54.9 (0.715) without optimization flags. In this table, the middle and the right columns show the results for PARI with and without precomputed values of Bernoulli numbers, respectively. Environments: For upper entries (): Fedora 15 64-bit operating system on Intel Core i5 M460 CPU (2.53 GHz, Max. 2.80GHz) with 4GB RAM. For lower functions used for this task are “diag_H” of ZKCM and entries (♣): Fedora 16 64-bit operating system on AMD FX-8120 “eigen” and “jacobi” of PARI. CPU (3.10GHz, Max. 4.00GHz) with 16GB RAM. Let us briefly explain our design of “diag_H”. In the prec ZKCM [sec] PARI with- PARI with function, the routine to diagonalize an Hermitian matrix e.g. out cache cache [sec] A employs a standard QR method (see, , Refs. [31, 32]) [sec] to find eigenvalues, for which Householder reflections and 3 3 4 Wilkinson shifts are utilized. The eigenvalues λi are sorted 3.73 10− 1.38 10− 1.63 10− × − × − × − from larger to smaller. Then each corresponding eigenvec- (8.72 10 5) (7.91 10 5) (6.10 10 5) × × × 256 3 3 4 tor vi is calculated by the inverse iteration. During this ♣2.95 10− 1.42 10− 1.48 10− | i × − × − × − process, we set (9.58 10 5) (2.73 10 4) (9.56 10 5) × × × 9.04 10 3 4.70 10 3 3.43 10 4 − − − A A + const. vi 1 vi 1 × − × − × − (2.95 10 4) (6.79 10 4) (1.49 10 4) ← × | − ih − | × × × 512 3 3 4 7.15 10− 3.12 10− 3.43 10− before calculating vi . This resolves degeneracy (in case × −4 × −4 × −4 | i (1.64 10 ) (6.78 10 ) (1.75 10 ) we have), so that calculated eigenvector vi becomes or- × × × 2 3 4 | i 1.86 10− 9.58 10− 6.61 10− thogonal to v0 ,..., vi 1 with high accuracy. We still × − × − × − | i | − i (2.62 10 4) (1.82 10 3) (3.67 10 4) test orthogonality and, in case required, perform an or- 768 × 2 × 3 × 4 1.45 10− 8.38 10− 6.02 10− thogonalization of vi ’s. Enhancement in accuracy of λi × − × − × − | i (2.80 10 4) (1.85 10 3) (2.07 10 4) is achieved during the inverse iteration steps. The steps × × × 2 2 3 3.48 10− 1.48 10− 1.25 10− are retried, often with an initial vector set to the vector × − × − × − (8.77 10 4) (8.32 10 4) (3.39 10 4) found in the previous round of steps, and sometimes with a 1024 × 2 × 2 × 3 2.59 10− 1.34 10− 1.03 10− randomly chosen initial vector, until sufficient convergence × − × − × − (3.67 10 4) (6.35 10 4) (3.64 10 4) of vi and λi for required precision is reached. × × × 2 2 3 | i 5.37 10− 2.44 10− 1.95 10− As for the functions of PARI, “eigen” firstly computes × − × − × − (1.41 10 3) (9.68 10 4) (6.87 10 4) the roots of the characteristic polynomial and secondly 1280 × 2 × 2 × 3 3.92 10− 2.09 10− 1.70 10− uses the Gaussian elimination to find eigenvectors; in con- × − × − × − (9.90 10 4) (1.56 10 3) (7.23 10 4) trast, “jacobi” simply uses the Jacobi method. There × × × are known facts on PARI library functions [29]: function “eigen” uses a naive algorithm so that it fails to compute eigenvectors for matrices with degenerate eigenvalues; function “jacobi” handles real symmetric matrices only, so that an Hermitian matrix A should be re- Re(A) Im(A) formed into a symmetric matrix (see Im(A)− Re(A) Ch. 11.5 of Ref. [30] for details about this reformation) as a workaround to find eigenvalues and eigenvectors using 6 the function. We compare the average time consumption to find all the eigenvectors of a random 100 100 Hermitian matrix A with a unit Frobenius norm for× several different internal Table 3: Comparison of the average real time consumption to find all precisions. To generate A, matrix elements aij = aji∗ are the eigenvectors of a normalized random 100×100 Hermitian matrix. simply randomly generated and then the normalization of The average was taken over ten different matrices. The standard the matrix is performed. The eigenvectors found by a di- deviation is shown in parentheses in small fonts. “prec” stands for agonalization are the column vectors of unitary matrix U the precision. Environments: Fedora 15 OS on Intel Core i5 M460 1 CPU with 4GB RAM for upper entries () and Fedora 16 OS on such that U − AU = diag. AMD FX-8120 CPU with 16GB RAM for lower entries (♣). Note: The probability for a random matrix to be degenerate For precision 256 [bits], function “eigen” of PARI stopped with an ∗ is very small. Tested matrices were in fact nondegenerate error and output no result. Precision was doubled for“jacobi” (see and “eigen” worked fine except for the precision 256 [bits] the text). for which it stopped due to a low-accuracy error. As for prec ZKCM, PARI, PARI, jacobi ∗ “ ”, we used the workaround as mentioned above diag H [sec] eigen [sec] jacobi [sec] so that the matrix actually input into the function was a 175 (0.105) N/A (N/A) 103 (0.324) 200 200 real symmetric matrix. In this case, we had to 256 137 (0.769) N/A (N/A) 91.2 (0.372) double× the precision of the matrix in order to keep the off- ♣ 1 259 (0.160) 171 (1.27) 265 (0.974) diagonal elements of D = U − AU sufficiently small (i.e., 512 (1.41) (0.426) (0.502) (prec 10) 203 145 202 Dij,i j . 2− − ) for the specified precision prec. = 413 (0.378) 237 (1.24) 477 (2.10) This| 6 was| not a theoreticale consequence but an empirical 768 330 (4.14) 206 (0.764) 367 (5.43) workarounde for the use of “jacobi”. In fact, with precision 632 (0.358) 378 (1.58) 726 (2.24) 512 (1024) [bits], “jacobi” constantly achieved Dij,i=j 1024 | 6 |∼ 501 (4.14) 309 (1.14) 506 (3.65) 1.0 10 78 1.0 2 256 ( D 1.0 10 155 1.0 − − ij,i=j − 903 (0.315) 503 (1.21) 1020 (5.47) 2 512× ). Thus,∼ it was× reasonable| that6 |∼ doubled× precisione ∼ was× 1280 − 704 (3.53) 404 (1.13) 734 (7.98) handed over to “jacobi” whene it was called. As shown in Table 3, time consumption for “diag_H” of ZKCM is on the same order of magnitude as those for the functions “eigen” and “jacobi” of PARI, as far as we tested for the precision between 256 and 1280 [bits]. In addition, we also compare the average time consumption to find all the eigenvectors of a random N N Hermi- tian matrix with a unit Frobenius norm for several× different values of N with fixed precision 768 [bits]. We doubled the precision for “jacobi” due to the above-mentioned reason. Table 4: Comparison of the average real time consumption to find all the eigenvectors of a normalized random N ×N Hermitian matrix un- As shown in Table 4, time consumption of “diag_H” is on der the fixed precision 768 [bits] [precision was doubled for“jacobi” the same order of magnitude as those for “eigen” and (see the text)]. The average was taken over ten different matrices. “jacobi” as far as we tested for 25 N 125. The standard deviation is shown in parentheses in small fonts. Envi- ≤ ≤ ronments: Fedora 15 OS on Intel Core i5 M460 CPU with 4GB RAM The evaluated functions are all designed to find eigen- for upper entries () and Fedora 16 OS on AMD FX-8120 CPU with vectors in addition to eigenvalues. One may think of the 16GB RAM for lower entries (♣). case only eigenvalues are needed, for which shorter computation time is expected. The PARI library does not N ZKCM, PARI, eigen PARI, jacobi have a function for this purpose. The ZKCM library has diag H [sec] [sec] [sec] the function “eigenvalues_H” which is created by simply 3.41 (0.0152) 0.941 (0.00406) 5.99 (0.0699) 25 omitting the inverse iterations from “diag_H”. The prob- ♣2.68 (0.0130) 0.841 (0.0399) 4.79 (0.0219) lem is that the precision of computed eigenvalues cannot 34.5 (0.0624) 14.5 (0.0248) 50.9 (0.364) 50 be guaranteed without corresponding eigenvectors. The 27.1 (0.174) 12.8 (0.0483) 40.9 (0.114) achievable precision in the absence of inverse iterations is 146 (0.238) 74.3 (0.636) 182 (0.977) 75 evaluated as below together with the time consumption. 114 (0.604) 64.9 (0.345) 144 (0.390) 413 (0.378) 237 (1.24) 477 (2.10) 100 Numerical error in the absence of inverse iterations. The 330 (4.14) 206 (0.764) 367 (5.43) 961 (1.10) 596 (1.72) 1070 (7.40) performance of “eigenvalues_H” is evaluated in Tables 125 5 and 6 in comparison with “diag_H”. The real time 752 (4.73) 508 (2.14) 754 (4.62) consumption to find all the eigenvalues of a normalized randomly-generated N N Hermitian matrix and the numerical error in the computed× eigenvalues are used for this comparison. Here, the matrix was generated by a random 7 Table 5: Comparison of the average real time consumption for “eigenvalues H” and “diag H” to find all the eigenvalues of a normalized random 100 × 100 Hermitian matrix. The average numerical error in the computed spectrum is also shown in the columns “Er- ror”. The average was taken over ten different matrices. In each cell, the standard deviation is shown in parentheses in small fonts. “prec” stands for the precision. Environments: Fedora 15 OS on Intel Core i5 M460 CPU with 4GB RAM for upper entries () and Fedora 16 OS on AMD FX-8120 CPU with 16GB RAM for lower entries (♣).

eigenvalues H diag H prec Time Error Time Error [sec] [sec] 32 76 9.24 1.70 10− 173 7.24 10− × − × − (0.250) (2.98 10 32) (1.05) (8.69 10 76) Table 6: Comparison of “eigenvalues H” and “diag H” in real time 256 × 32 × 76 consumption to find all the eigenvalues of a normalized random N × ♣7.26 1.19 10− 136 3.83 10− × − × − N Hermitian matrix, for several different matrix sizes N for the (0.149) (2.00 10 32) (1.13) (2.01 10 76) × × fixed precision 768 [bits]. The average over ten trials is shown. The 49 153 13.2 1.32 10− 259 3.86 10− columns “Error” show the average numerical error in the computed × − × − (0.194) (2.99 10 49) (1.60) (1.26 10 153) spectrum. The standard deviation is shown in parentheses for each × × 512 49 153 value. Environments: Fedora 15 OS on Intel Core i5 M460 CPU 10.4 3.63 10− 202 3.96 10− × − × − with 4GB RAM for upper entries () and Fedora 16 OS on AMD (0.142) (6.70 10 49) (1.88) (2.00 10 153) × × FX-8120 CPU with 16GB RAM for lower entries (♣). 57 230 19.1 9.35 10− 413 3.01 10− × − × − (0.368) (8.62 10 57) (2.61) (1.12 10 230) × × eigenvalues H diag H 768 56 230 15.2 3.71 10− 327 4.88 10− N Time Error Time Error × − × − (0.272) (8.16 10 56) (2.82) (5.43 10 230) [sec] [sec] × × 56 307 57 231 27.6 2.08 10− 637 4.03 10− 0.438 4.46 10− 3.34 9.10 10− × − × − × −57 × −231 (0.454) (3.71 10 56) (3.27) (3.58 10 307) (0.0179) (8.47 10 ) (0.0253) (3.84 10 ) × × 1024 × 56 × 307 25 57 231 22.1 4.00 10− 498 3.55 10− ♣0.343 3.42 10− 2.69 6.35 10− × − × − × −57 × −231 (0.346) (7.49 10 56) (2.99) (3.01 10 307) (0.0153) (8.76 10 ) (0.0302) (1.72 10 ) × × × × 56 384 56 230 35.9 6.00 10− 904 6.65 10− 2.75 1.67 10− 34.0 2.28 10− × − × − × −56 × −230 (0.871) (9.69 10 56) (4.00) (1.18 10 383) (0.0704) (3.14 10 ) (0.220) (2.51 10 ) × × 1280 × 54 × 384 50 57 230 28.4 9.84 10− 706 2.11 10− 2.19 9.67 10− 27.2 2.08 10− × − × − × −56 × −230 (0.539) (3.10 10 53) (4.64) (6.12 10 385) (0.0824) (1.49 10 ) (0.190) (1.02 10 ) × × × × 56 230 8.44 7.30 10− 144 3.16 10− × − × − (0.149) (1.58 10 55) (0.762) (3.58 10 230) × × 75 56 230 6.73 2.64 10− 114 2.40 10− × − × − (0.163) (3.56 10 56) (0.692) (1.03 10 230) unitary transformation of a diagonal matrix D with ran- × × 57 230 dom real elements, where D was normalized by its Frobe- 19.1 9.35 10− 413 3.01 10− × − × − (0.368) (8.62 10 57) (2.61) (1.12 10 230) nius norm in advance. The numerical error in the com- × × 100 56 230 puted spectrum e′i for “eigenvalues_H” is quantified 15.2 3.71 10− 327 4.88 10− { } × −56 × −230 by E( e )= e e where e are the true eigenval- (0.272) (8.16 10 ) (2.82) (5.43 10 ) ′i i i ′i i × × { } | − | 55 230 ues. Here, e′i and ei are sorted in the same order (we 36.4 1.22 10− 950 3.76 10− { P} { } × −55 × −230 employed the descending order). Similarly, the numeri- (0.794) (2.28 10 ) (5.58) (1.75 10 ) × × 125 56 230 cal error in the computed spectrum e′′i for “diag_H” is 28.4 1.52 10− 753 3.15 10− { } × −56 × −231 quantified by E( e )= e e . For each cell of the (0.478) (2.08 10 ) (5.25) (8.01 10 ) ′′i i i ′′i × × tables, we performed{ } the same| − simulation| ten times with different random matricesP and took the average for the time consumption or the numerical error. Note that this performance evaluation was separately performed with the evaluations shown in the previous tables. The matrix size N was fixed to 100 and several values of precision were tried for Table 5; the precision was fixed to 768 [bits] and several values of N were tried for Table 6. It has turned out that “eigenvalues_H” is faster than “diag_H” by one or two orders of magnitude to compute eigenvalues. The only difference in the internal struc- tures of these functions is the inverse iterations to find eigenvectors and at the same time enhance the accuracy 8 of eigenvalues. Thus the result shows most of time con- As for installation of the ZKCM QC library, sumption in “diag_H” is spent for inverse iterations. A the standard process “./configure make drawback to use “eigenvalues_H” is the nonnegligible er- sudo make install” should work; if not,→ the docu-→ ror in the computed eigenvalues. For instance, we have ment should be consulted. 49 < E( e′i ) > 10− when the precision is 512 [bits], i.e., We briefly overview the theory of the MPS simulation { } ∼ 154 when the machine epsilon is 10− . Indeed, the error before introducing a program example. is much smaller than the machine∼ epsilon for double pre- 16 cision ( 10− ), but is not in the acceptable range for 4.1. Brief overview of the theory of time-dependent MPS multiprecision∼ computation. simulation So far, the ZKCM library has been introduced and eval- Consider an n-qubit pure quantum state uated on its performance. An extension library for the 1 1 study of quantum computation will be introduced in the ··· Ψ = ci0 in−1 i0 in 1 next section. | i ··· | ··· − i i0 in−1=0 0 ··· X ··· 2 4. ZKCM QC library − with i0 in−1 ci0 in 1 = 1. If we keep this state as data as it··· is, updating| ··· the| data for each time of unitary The ZKCM QC library is an extension of the ZKCM timeP evolution spends O(22n) floating-point operations. library. It has several classes to handle tensor data use- To avoid such an exhaustive calculation, in the matrix- ful for the (time-dependent) MPS simulation of a quan- product-state method, the data is stored as a kind of com- tum circuit. Among the classes, the “mps” class and the pressed data. The state is represented in the form “tensor2” class will be used by user-side programs. The former class conceals the complicated MPS simulation pro- 1 1 m0 1 m1 1 mn−2 1 − − − cess and enables writing programs in a simple manner. Ψ = | i ··· ··· The latter class is used to represent two-dimensional ten- i0=0 in−1=0 v0=0 v1=0 vn−2=0 X X X X X sors which are often simply regarded as matrices. A quan- Q0(i0, v0)V0(v0)Q1(i1, v0, v1)V1(v1) (1) tum state during an MPS simulation can be obtained as ··· Qs(is, vs 1, vs)Vs(vs) a (reduced) density matrix in the type of “tensor2”. For ··· − ··· convenience, there are functions to convert a matrix in Vn 2(vn 2)Qn 1(in 1, vn 2) i0 in 1 , the type of “tensor2” to the type of “zkcm_matrix” and ··· − − − − − | ··· − i vice versa. More details of the classes are explained in the n 1 document placed in the “doc” directory of the ZKCM QC where we use tensors Qs s=0− with parameters is, vs 1, vs { } − package. (parameters v 1 and vn 1 exceptionally do not exist) and n 2 − − Vs − with parameter vs; ms is a suitable number of It might be curious to condensed-matter physicists why { }s=0 multiprecision computation is needed in an MPS simu- values assigned to vs with which the state is represented lation. Indeed, truncations of Schmidt coefficients (and precisely or well approximated. This form is one of the corresponding Schmidt vectors) have been studied as a forms of matrix product states (MPSs). The data are kept possible source of errors [33] while precision of basic in the tensors. We can see that neighboring tensors are floating-point operations has not been of main concern in correlated to each other; the data compression is owing to physics simulations using the MPS data structure. Pre- this structure. Let us explain a little more details: Qs(is, vs 1, vs) is a cision shortage cannot be a main source of errors in light − of truncation errors. However, it has been uncommon5 tensor with 2 ms 1 ms elements; Vs(vs) is a tensor in × − × to use a truncation of nonzero Schmidt coefficients in which the Schmidt coefficients for the splitting between the time-dependent MPS simulations of quantum computing sth site and the (s+1)th site (i.e., the positive square roots [12, 19, 27]. Under this condition, precision shortage be- of nonzero eigenvalues of the reduced density operator of comes the only crucial source of errors and it is hence of qubits 0,...,s) are stored. This implies that, by using 0...s s+1...n 1 our main concern how much precision is required for an ac- Vs and eigenvectors Φvs ( Φvs − ) of the reduced 0...s| s+1i ...n| 1 i curate simulation of quantum computing. We will discuss density operator ρ (ρ − ) of qubits 0,...,s (s + 1,...,n 1), the state can also be written in the form of more on this issue in Sec. 4.3. In particular, we will show − an example of simulating a quantum algorithm for which a Schmidt decomposition truncation of at most a single nonzero Schmidt coefficient ms 1 − for each step results in a significant error; in addition, a 0...s s+1...n 1 Ψ = Vs(vs) Φ Φ − . (2) precision slightly beyond the double precision is necessary | i | vs i| vs i vs=0 for this example. X In an MPS simulation, very small coefficients and cor- 5 responding eigenvectors can be truncated out unlike a Here, we are considering the standard quantum circuit model for quantum computing. It is not the case in a Hamiltonian-based usual Schmidt decomposition. It may happen that all the adiabatic-evolution model [34]. nonzero coefficients are nonnegligible. This is actually the 9 case in practice when we simulate quantum computing as declared in the namespace “tensor2tools” (see the doc- we will discuss in Sec. 4.3. We can still enjoy a consid- ument placed in the “doc” directory for the details of this erable data-size reduction since vanishing coefficients are namespace). truncated out.

Besides the truncation, a significant advantage to use #include "zkcm_qc.hpp" the MPS form is that we have only to handle a small number of tensors when we simulate a time evolution under int main (int argc, char *argv[]) each quantum gate. For example, when we apply a uni- { //Use the 256-bit precision. tary operation U(4) acting on, say, qubits s and s + 1, ∈ zkcm_set_default_prec(256); we have only to update the tensors Qs(is, vs 1, vs), Vs(vs), //Num. of digits for each output is set to 8. − and Qs+1(is+1, vs, vs+1). For the details of how tensors zkcm_set_output_nd(8); are updated, see Refs. [12, 27]. The simulation of a sin- //First, we make an MPS representing |000>. gle quantum gate U(4) spends O(m3 ) floating-point ∈ max mps M(3); operations where mmax is the largest value of ms among std::cout << "The initial state is " the sites s. (Usually, unitary operations U(2) and those << std::endl; U(4) are regarded as elementary quantum∈ gates.) A //Print the reduced density operator of the ∈ //block of qubits from 0 to 2, namely, 0,1,2, quantum circuit constructed by using at most g single- //using the binary number representation for qubit and/or two-qubit quantum gates can be simulated //basis vectors. 3 std::cout << M.RDO_block(0, 2).str_dirac_b() within the cost of O(gnmmax,max) floating-point opera- << std::endl; tions, where n is the number of wires and mmax,max is the largest value of mmax over all time steps. std::cout << "Now we apply H to the 0th qubit." The computational complexity may be slightly differ- << std::endl; ent for each software using MPS. In the ZKCM QC li- M.applyU(tensor2tools::Hadamard, 0); brary, we have functions to apply quantum gates U(8) ∈ std::cout<< "Now we apply CNOT to the \ to three chosen qubits. Internally, three-qubit gates are qubits 0 and 2." handled as elementary gates. This makes the complex- << std::endl; ity a little larger. A simulation using the library spends M.applyU(tensor2tools::CNOT, 0, 2); 4 O(gnmmax,max) floating-point operations, where g is the //The array is used to specify qubits to compute number of single-qubit, two-qubit, and/or three-qubit //a reduced density matrix. It should be gates used for constructing a quantum circuit. As the pro- //terminated by the constant mps::TA. cess to update the tensors to simulate a three-qubit gate int array[] = {0, 2, mps::TA}; std::cout << "At this point, the reduced \ has not been written in detail in the literature as far as density matrix of the qubits 0 and 2 is " the author knows, it is explained in Appendix A. << std::endl The MPS simulation process, which is in fact often com- << M.RDO(array).str_dirac_b() << std::endl; plicated, can be concealed by the use of ZKCM QC. One zkcm_quit (); may write a program for quantum circuit simulation in return 0; an intuitive manner. Here is a very simple example. In } the followings, we use version 0.1.0 of ZKCM QC for our Listing 2: qc simple example.cpp programs. The program is compiled and executed typically in the following way. 4.2. Program example [user@localhost foo]$ c++ -o qc_simple_example \ As a simple example, we simulate the time evolution qc_simple_example.cpp -lzkcm_qc -lzkcm -lmpfr -lgmp \ -lgmpxx H 1 000 ( 000 + 100 ) [user@localhost foo]$ ./qc_simple_example | i 7→ √2 | i | i The initial state is CNOT 1 1.0000000e+00|000><000| ( 000 + 101 ) 7→ √2 | i | i Now we apply H to the 0th qubit. Now we apply CNOT to the qubits 0 and 2. 1 1 At this point, the reduced density matrix of \ where Hadamard operation H = 1 acts on √2 1 1 the qubits 0 and 2 is qubit 0 and CNOT = 00 00 + 01 01 + 10 −11+ 11 10 5.0000000e-01|00><00|+5.0000000e-01|00><11|\ +5.0000000e-01|11><00|+5.0000000e-01|11><11| acts on qubits 0 and| 2.ih Then| we| ih see the| | reducedih | | densityih | matrix of qubits 0 and 2. Besides this example, the package of ZKCM QC has The program example is shown in Listing 2. (A slightly sample programs for simulating Grover’s quantum search extended sample code is placed in the “samples” directory [38] and simulating a simple projective measurement. In of the ZKCM QC package.) It utilizes several matrices addition, we will see an example to simulate the quantum 10 4 4 4 search under the simplest setting in Appendix A.1, which 0 H g g -1 H is written as a part of the explanation of how to handle a 4 4 4 three-qubit gate in an MPS simulation. 0 H g g -1 H Ng 4.3. Source of numerical errors in an MPS simulation of 4 4 4 quantum computing 0 H g g -1 H

It is a standard strategy in the MPS simulation and re- 0 ...... lated methods in computational condensed matter physics 0 X H to impose a certain threshold m to the number of trunc (a) Schmidt coefficients; only larger mtrunc Schmidt coefficients (and corresponding Schmidt vectors) are employed and the remaining are discarded at each time when tensors are updated [13, 14]. It has been tacitly assumed that g 0 0 truncations are the main source of numerical errors and a 0 0 numerical error due to precision shortage has not gathered 0 attention. Venzl et al. [33] pointed out that hardness of a (b) time-dependent MPS simulation depends on the distribution of Schmidt coefficients. They reported certain param- Figure 3: (a) Quantum circuit of the Deutsch-Jozsa algorithm for the specified function (see the text). (b) Internal structure of gate eter choices in the tilted Bose-Hubbard model, for which g. Gate g−1 is the reverse of g. the distribution has a rather long tail for larger Schmidt coefficients, i.e., negligibly small ones are not dominant. In such a case, mtrunc should be rather close to the maximum Schmidt rank (over all splittings between consecutive sites We consider the Deutsch-Jozsa algorithm [39]. In a and over all time steps) in order to keep the truncation brief explanation, there is a promise that function f : error small. In contrast, a usual time-dependent MPS sim- 0, 1 l 0, 1 is either balanced (i.e., # x f(x)=0 = ulation for other nearest-neighbour coupling models stud- {# x}f(→{x)=1}) or constant (i.e., f(x) is{ same| for all}x) ied so far uses a relatively small value for mtrunc, typically { | } where x 00 0l 1,..., 10 1l 1 . The task is to de- fifty to one hundred [35]. With this much value of mtrunc, cide whether∈{ a··· given− function···f is− balanced} or constant. a time-dependent MPS simulation usually does not exhibit l 1 This takes 1+2 − queries for the worst case in classical notable accumulation of numerical errors unless more than computation while it takes only a single query in quantum several hundred time steps of evolution is tried [35]. A sig- computation with the Deutsch-Jozsa algorithm. The al- nificant accumulation of truncation errors were reported l l gorithm is described as follows: (i) Apply H⊗ Vf H⊗ to l for a larger number of time steps [35, 36]. qubits initially in the state 00 0l 1 where Vf is an op- As mentioned before, it has been uncommon to use trun- eration to put the factor ( 1)| f(···x) to− eachi x ; (ii) Measure cations in (time-dependent) MPS simulations of quantum the l qubits in the computational− basis.| Wheni f is bal- computing [12, 19, 27]. This is reasonable because quan- anced, the probability of finding the qubits simultaneously tum algorithms involves many H and CNOT operations in 0’s in this measurement vanishes; when f is constant, it (see Sec. 4.2 for their definitions). This makes a quan- is exactly unity. For the details of the theoretical analysis tum state evolving under a quantum circuit quite often of the algorithm, see, e.g., Sec. 3.1.2 of Ref. [10]. a nearly equally biased superposition of several indices. y y Let us consider a particular function f( 0 Ng 1)= By tracing-out some subsystem under this condition, we ··· − Ng 1 have a reduced density matrix whose nonzero eigenvalues i=0− g(yi) with g(x0x1x2x3) = (x0 x1) (x1 x2) 4 ∧ ∨ ∧ ∨ are all nonnegligible. Thus it is not possible to truncate (x2 x3) where yi 0, 1 and xj 0, 1 ; Ng is a out even a single nonzero Schmidt coefficient, irrespective Lpositive∧ integer (here,∈ symbol { } stands∈ for { the} exclusive of the number of time steps. In addition, we have re- OR operation). Figure 3 shows the quantum circuit of cently shown [37] that an accumulation of errors due to the algorithm for this function.L This function is balanced precision shortage is significant even without truncation of for any value of Ng 1. In the figure, a single and nonzero Schmidt coefficients as far as we have seen in an the connected represent≥ the CNOT operation with• the example to iterate the quantum search routine [38]. Thus control qubit specified⊕ by and the target qubit specified high-precision computation has a practical advantage in by . It flips the target• qubit if the control qubit is in an MPS simulation for a quantum circuit model with a 1 .⊕ Similarly, two ’s and connected to each other rep- large circuit depth. resent| i the CCNOT• operation.⊕ It flips the target qubit if Here, we show a typical example where truncating out the two control qubits are in 11 . (See also Appendix A at most a single nonzero Schmidt coefficient for each time for the implementation of three-qubit| i operations in our causes a significant error. In this example, slightly more library.) By the structure of the circuit, each of the Ng than double precision is required for a stable simulation measurements should report Prob(0000) = 0 if there is no although the depth of the quantum circuit is not very large. numerical error. (This is easily proved: assuming different 11 values of Prob(0000) for two different bundles of qubits After writing instructions corresponding to individual contradicts to the fact that the bundles are equivalent to quantum gate operations, we need some code lines to ob- each other by the circuit structure.) tain the probability of finding zeros in the output. Al- It is straight-forward to write a program code for the though we have a function to perform a projective mea- quantum circuit using the ZKCM QC library. We begin surement acting on a single qubit, this is not conve- with header file descriptions: nient. We instead compute the (0, 0) element of the reduced density matrix of qubits of our interest. To obtain #include Prob(00010203), for example, we write #include std::cout << "Prob(0_0 0_1 0_2 0_3)=" The second header file is needed by the following function << M.RDO_block(0, 3).get(0, 0) to obtain the current time in seconds: << std::endl; double current_time_in_sec () In this way, one can easily write a program code for the { quantum circuit. timeval T; In the present context, it is also demanded to see the ::gettimeofday(&T, NULL); intrinsic data of the MPS step by step. To obtain the return ((double)(T.tv_sec) number ms of surviving Schmidt coefficients Vs(vs) for the + 1.0e-6 * (double)(T.tv_usec)); splitting between sites s and s + 1, one may write, e.g., } int num = M.get_m(s); We use this function when we record the data of time consumption for Table 7. We may omit it otherwise. The for some integer 0 s n 2. In addition, each Schmidt ≤ ≤ − main function begins with coefficient is accessible by “M.V(s)(i)” which is a reference to the ith element of Vs. It is also convenient int main (int argc, char *argv[]) to use “M.get_m_max()” (mmax = maxsms for the cur- { rent time step) and “M.get_m_maxmax()” (mmax,max = zkcm_set_default_prec (prec); maxtmaxsms where t stands for time) so as to find the with some precision “prec” and ends with time step where the Schmidt rank reaches the maximum. It should be noted that sometimes the maximum Schmidt zkcm_quit(); rank appears during the internal process hidden behind return 0; each gate operation. This is because each gate operation } for nonconsecutive qubits needs to internally move them to consecutive places before operation and move them back We describe the quantum circuit step by step in to their original places after operation owing to the one- the middle part of the main function. (We call dimensional data structure of MPS. This is done by swap- “current_time_in_sec()” before and after this part to ping consecutive qubits step by step. For tracing the inter- calculate the time consumption.) First of all, an object of nal time evolution of ms’s, one needs to mimic this process the MPS data structure for n =9Ng + 2 qubits is created by using the operation “M.SWAP(a,b);” (this swaps spec- by ified qubits a and b) manually. mps M(n); We have explained how the program code is written M.set_m_trunc(mtrunc); to simulate the quantum circuit of Fig. 3. Now we set Ng to 7; we have 65 qubits in total in the circuit. It is where we also set the value of mtrunc by our option as in certainly intractable to handle the circuit by the brute- the second line. Then we write each gate operation one by force method because the dimension of the Hilbert space one. For example, a Pauli X gate (or a bit flip) acting on is 265 3.69 1019. It was numerically found, however, ≃ × the ith qubit is written as that the largest possible Schmidt rank of the MPS during evolution in the circuit is only 28 as shown in Fig. 4. M.applyU(tensor2tools::PauliX, i); Because of this reason, the MPS simulation of the algorithm took only approximately seven minutes (see Table Similarly, an Hadamard gate operation is written in the 7) when the precision was set to 256 [bits] and no trunca- same manner with tensor2tools::Hadamard. A CNOT tion of nonzero Schmidt coefficients was employed. Here, gate with control qubit c and target qubit t is written as we used a computing server with the Red Hat Enterprise M.applyU(tensor2tools::CNOT, c, t); Linux 6 64-bit OS, Intel Xeon E7-8837 2.67GHz (2.80GHZ maximum) CPU, and 315GB RAM, instead of the common A CCNOT operation with control qubits a, b, and target PCs that we normally used.6 qubit t is written as

M.CCNOT(a, b, t); 6This was because a PC with 4GB RAM became unstable due to 12 0.1 36 Error mmax,max 32 20 19 0.08 28 18 17 16 24 15 0.06 14 20 13 12 Error 16 max,max 11 0.04 m 10 12 9

Multiplicity 8 0.02 8 7 6 4 5 4 0 0 3 12 16 20 24 28 32 2 1 m trunc 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Schmidt coefficient Figure 4: Error in the computed value of Prob(00010203) and mmax max as functions of mtrunc. Precision was fixed to 256 bits. , Figure 5: Distribution of nonzero Schmidt coefficients at the point where the second CNOT gate was being processed among the 2Ng +1 CNOT gates (Ng was set to 7) in the middle part of Fig. 3(a). Note that we are employing the definition of Schmidt coefficients with square roots in this paper. Thus the sum of squared values of the The main aim of the simulation is to investigate if Schmidt coefficients equals to one. any truncation of nonzero Schmidt coefficients is possible. Figure 4 shows the error in the computed value of Prob(00010203) (the discrepancy from zero) as a function of mtrunc. The maximum Schmidt rank mmax,max observed during the simulation is also plotted in the figure as a func- Table 7: Precision (prec) dependence of the numerical error (Error) tion of mtrunc. It is clearly shown that mtrunc should be in the computed value of Prob(00010203). No truncation of nonzero equal to or more than the exactly maximum Schmidt rank Schmidt coefficients was employed. The real time consumed for the that is observed in the absence of truncations; otherwise simulation is also shown (Time), which is an average over 10 trials of a significant numerical error appears. (In Ref. [37], we the same simulation (the standard deviation is shown in the parentheses). The simulated circuit had 65 qubits because Ng was set to have shown a similar result for a smaller circuit of the al- 7. Environment: Red Hat Enterprise Linux 6 64-bit OS, Intel Xeon gorithm for a different balanced function.) As previously E7-8837 2.67GHz (2.80GHZ maximum) CPU, and 315GB RAM. mentioned, the distribution of nonzero Schmidt coefficients is closely related to this phenomenon. We show the distri- prec Error Time [sec] Convergence failure in bution in Fig. 5, which was taken at the point where the 55 N/A second CNOT gate was being processed among the 2Ng +1 ≤ some of eigenvectors 32 CNOT gates in the middle part of Fig. 3(a). It clearly de- 56 1.80 10− 341 (10.4) × 26 picts that none of the nonzero Schmidt coefficients was 57 5.66 10− 335 (10.2) × 32 negligible. 58 1.72 10− 338 (7.82) × 33 59 1.55 10− 379 (6.45) In addition, it was found that precision slightly more × 34 than the double precision was required to perform a stable 60 1.92 10− 348 (13.3) . ×. . simulation even when we did not impose any truncation . . . 139 of nonzero Schmidt coefficients, as shown in Table 7. For 256 5.91 10− 416 (8.50) × precision less than or equal to 55 bits, some of matrix . . . diagonalization routines failed after the half point of the . . . 121 circuit, which indicates an accumulated error due to preci- 512 7.16 10− 709 (19.0) . ×. . sion shortage. More specifically speaking about this case, . . . a small non-Hermitian fraction due to the accumulated er- 460 768 1.78 10− 922 (19.7) ror in a reduced density matrix resulted in the convergence × . . . error of eigenvectors during an update of tensors. . . . 615 1024 6.50 10− 1620 (12.3) . ×. . . . . 770 memory shortage for this simulation for precision ≥ 1024 bits. Each 1280 7.77 10− 2170 (16.9) run of the simulation in the computing server consumed at most × approximately 8GB memory space for precision 1024 bits and 1280 bits. 13 5. Discussion is slower, by a significant constant factor, than the QR method...”—page 571 of Ref. [30]; “...the Jacobi method 5.1. Discussion on the ZKCM library is several times slower than a reduction to tridiagonal form, With the ZKCM library, standard tasks in numerical followed by Francis’s algorithm.”—page 488 of Ref. [40]. matrix calculations can be completed within the time con- Thus, it is unusual that, for a relatively large matrix with sumption on the same order of magnitude as that with the order 100, there is no significant advantage in using the PARI library as we have seen in the comparison per- the Householder-QR method. In a phenomenological ex- formed in Sec. 3. The comparison also showed that ZKCM planation, this is due to the large cost of inverse iterations. is slower than PARI, even in a simple matrix multiplica- By using the Gprof program [41] (a monitoring software), tion, which suggests that basic arithmetic operations make it turned out that, in “diag_H”, more than 71.1 percent of a certain gap in speed of the two libraries. In fact, ZKCM running time was spent for the inverse iterations in case of uses MPFR functions while PARI uses GMP functions a 100 100 matrix with the precision 512 [bits]. In fact for the basic arithmetic computation. MPFR is approxi- speedup× of one or two orders of magnitude≥ was possible mately two and a half times as slow as GMP in multiplying by omitting inverse iterations for finding eigenvalues as we floating-point numbers as far as we tested.7 Thus it is rea- have shown in Tables 5 and 6. It was however impossible sonable to find a certain gap in computational speed. We to achieve a required precision without inverse iterations believe the gap is acceptable as long as it does not change as clearly shown in the tables. the order of magnitude of computation time for basic ma- Thus, it was unexpectedly expensive to achieve a suffi- trix computation. It has been probably a historical reason cient convergence by inverse iterations. We, of course, used that PARI has not used MPFR so far.8 We chose MPFR the LU decomposition to reduce the cost of each inverse because it has floating-point exceptions and other useful iteration. This, however, did not mitigate the total cost of functionalities for real floating-point numbers by default. the inverse iterations sufficiently. As a matter of fact, the The only unexpected result in Sec. 3 is the one about total number of the inverse iterations had to be increased Hermitian matrix diagonalization shown in Tables 3 and along with the increase in required precision. This, at a 4. It has been shown that time consumptions of functions glance, does not look in accordance with a conventional “diag_H” of ZKCM and “eigen” and “jacobi” of PARI theory of inverse iteration [42] suggesting that the ma- are on the same order of magnitude. This is an unusual chine epsilon does not make a difference in the process of result because the Jacobi method employed in “jacobi” inverse iteration. The conventional theory, however, con- looks as fast as a variant of the QR method employed in siders the process where the inverse iteration is used for “diag_H” for a 100 100 matrix. computing an eigenvector for a given approximate eigen- × It is likely that “diag_H” and “jacobi” are used for value. In our case, in contrast, we also improve accuracy of a similar purpose or under a similar situation, since they the eigenvalue using the computed eigenvector. A routine both work fine for a matrix with degenerate eigenvalues. of inverse iteration is often called several times in order for (It is known [29] that “eigen” almost always fails for a achieving sufficient accuracy for each pair of an eigenvalue matrix with degenerate eigenvalues.) In this point of view, and the corresponding eigenvector. Furthermore, program too, it is meaningful to compare “diag_H” and “jacobi”. routines called by the routine of inverse iteration are not As mentioned above, function “jacobi” of PARI is an error-free in practice. Thus it was not surprising that the implementation of a standard Jacobi method. Function choice of precision considerably affected the cost of inverse “diag_H” of ZKCM employs the Householder-QR method iterations (and hence the time consumption of “diag_H” with the Wilkinson shift for computing approximate eigen- shown in Table 5). values and uses several sets of inverse iterations to find Another factor that makes inverse iterations expensive eigenvectors and at the same time to enhance the accu- is the cost of basic arithmetic operations in multipreci- racy of eigenvalues. It has been commonly known that the sion computing. Unlike double-precision computing where Jacobi method is slower than the QR method: “For matri- every basic arithmetic instruction is performed within a ces of order greater than about 10, however, the algorithm few clock cycles, it takes quite many cycles.9 It is fast in double-precision computing to successively perform arithmetic instructions because this does not involve any con- 7We wrote a test program to compute x ←− x ∗ y with y = 1.00 − 1.00 × 10−7 for 108 times where the initial value of x was ditional branching. In contrast, it is inevitable to involve set to 1. The precision was set to 512 bits. It took 15.3 seconds several conditional branching instructions in every arith- on average (with standard deviation 1.09 × 10−1) when we used metic operation in multiprecision computing. Thus, the MPFR’s “mpfr t” structure for floating point numbers while it took simple fact that basic arithmetic instructions are not hard- only 6.11 seconds on average (with standard deviation 4.91 × 10−2) when we used GMP’s “mpf t” structure. The average was taken over ware instructions should be a large factor. ten trials. This test was performed on a machine with the Fedora 15 64-bit OS, Intel Core i5 M460 CPU, and 4GB RAM. GMP version 9Theoretically, each instruction with precision prec takes 4.3.2 and MPFR version 3.0.0 were used. O(precc) time with c ≤ 2 dependent on the chosen algorithm. In 8 PARI has been developed since 1979 [24]. GMP appeared in real CPU time, it should scale slightly better in practice because of 1991 [8] and MPFR appeared in 2000 [9]. PARI already had plenty vectorization of operands. See the manual of GMP [8] for the details of fast floating-point functionalities at the time MPFR appeared. of algorithms for arithmetic instructions. 14 In future, we should find a strategy for faster Hermi- use a parallelized exact matrix computation and spend a tian matrix diagonalization. As a possible approach, the massive computational resource (e.g., more than one thou- precision of computation for finding approximate eigen- sand CPU processes and several hundred gigabytes physi- values can be internally enlarged so that a relatively small cal memory to handle less than forty qubits [45]). It seems number of subsequent inverse iterations should be enough that an accumulation of rounding errors is not significant to achieve a required precision. Table 5, however, shows for these simulators although they use double precision that accuracy of approximate eigenvalues is not very much computation, as this was not reported so far. In contrast enhanced by simply increasing the preset precision in the to their approach, use of the MPS method is quite eco- Householder-QR method. So far, the author could not nomical. As we have shown in Sec. 4.3, we could handle find the reason of this tendency. A detailed theoreti- 65 qubits in the MPS simulation of the Deutsch-Jozsa al- cal and practical analyses of the present implementation gorithm under a certain setup, which ran as a single CPU (and perhaps the method itself) will be required to over- process10 and took only (approximately) seven minutes in come this difficulty. For another approach, there is a case of 256-bit floating-point precision. possibility that a certain method uncommon under the Besides the MPS method, Viamontes et al.’s method double-precision environment possesses an advantage un- [16, 56, 57] using a compressed graph representation is also der a high-precision environment in a practical sense. It quite economical to simulate quantum computation. Ac- is of interest to re-investigate practical usefulness of con- cording to Refs. [16, 57], its compressed data structure is ventional methods [43] for eigenvalue problems under a sensitive to rounding errors so that they used multipreci- high-precision environment. sion computation based on the GMP library. The MPS Apart from computational speed, the syntax of a library data structure (1) is of course a compressed data struc- is also important for usability. As a C++ library, ZKCM ture for which we needed multiprecision computation for provides an easy-to-use syntax for handling matrices by stable quantum circuit simulation. It will be interesting to operator overloading. It also provides various functionali- theoretically investigate if a simulation of quantum com- ties for matrix computation as member functions of a class puting utilizing a compressed data structure generally has as well as as external functions. Thus a program code a certain inevitable sensitivity to rounding errors. using the library should look simpler than those written Finally, we discuss on the computational cost of the with some other multiprecision libraries for Fortran or C MPS method, which is known to grow polynomially in languages. In general, it is relatively easy to write an ex- the maximum Schmidt rank mmax,max during the simu- tension library of a C++ library. As we have introduced, lation [12] as mentioned in Sec. 4.1. It is expected that ZKCM has an extension library named ZKCM QC devel- MPS simulation becomes very expensive for quantum cir- oped for a (time-dependent) MPS simulation of quantum cuits of algorithms for hard problems like those for quan- circuits. The main library and the extension are under tum prime factorization [48] of a large composite num- steady development. Latest development versions can be ber. It has been discussed [49] that large entanglement downloaded from the repositories linked from the URL [7]. (this usually leads to a large Schmidt rank) must be involved in quantum prime factorization. So far, Kawaguchi 5.2. Discussion on the ZKCM QC library et al. [50] found Schmidt rank 92 in a modular exponenti- As for the ZKCM QC library, it is worth mentioning ation circuit (this is used in quantum prime factorization) that multiprecision computation is useful in MPS simu- with 35 qubits in their MPS simulation. Nevertheless, lations of quantum computing as discussed in Sec. 4.3. it has been neither proved nor numerically verified that Indeed, truncation errors are dominant whenever trunca- mmax,max grows exponentially in the number n of qubits as tions of nonzero Schmidt coefficients are employed. It is far as the author knows. Although there have been several however uncommon to employ such truncations in simulat- studies on entanglement during quantum prime factoriza- ing quantum computing since they cause an unacceptably tion [51, 52, 53, 54], they have not reached an answer to large error as we have discussed. As a consequence, the how entanglement grows in n. It is still an open problem rounding error become the only possible error. In our ex- how we rigorously estimate the value of mmax,max for a ample, slightly more than double precision was required given quantum circuit. Presently, a known upper bound for reliable MPS simulation. This is true in a different for mmax,max is a function exponentially growing in the example to simulate quantum search, which is shown in number of basic quantum gates overlapping to each other our related contribution [37]. Thus, time-dependent MPS (i.e., those stretching across the same bundle of wires) simulation is fragile not only against accumulating trunca- [55]; this is however not practically useful for a user of tion errors [35, 36] but also against accumulating rounding the MPS method to estimate the value of mmax,max for errors. High-precision computation is therefore beneficial his/her simulation. More theoretical and numerical efforts to the MPS method. are required to understand the scalability of the method. Apart from the MPS method, there have been several simulators of quantum computing which make use of par- allel programming techniques [44, 45, 46, 47] to mitigate 10Every simulation with ZKCM or ZKCM QC shown in this paper the time consumption of the brute-force method. They was run as a single CPU process. 15 6. Summary case l +2= n 1, the tensor Vl should be regarded as − +2 unity and the parameter vl+2 should be dropped.) We have introduced the ZKCM library which is a C++ The state after U is applied to the qubits l, l + 1, and library developed for multiprecision complex-number ma- l +2 can be written as trix computation. It is especially usable for high-precision numerical simulations in quantum physics; it has an easy- Ψ = Θ(il,il+1,il+2, vl 1, vl+2) | i − to-use syntax for matrix manipulations and helpful func- v − v i i i Xl 1 Xl+2 l lX+1 l+2 (A.1) tionalities like the tensor-product and tracing-out opera- e 0,...,l 1 l+3,...,n 1 vl 1 − il il+1 il+2 vl+2 − tions, the discrete Fourier transform, etc. An extension × | − i| i| i| i| i library ZKCM QC has also been introduced, which is a li- with the tensor brary for a multiprecision time-dependent matrix-product- state simulation of quantum computing. It enables a user- Θ(il,il+1,il+2, vl 1, vl+2)= friendly coding for simulating quantum circuits. − v v +1 k k k Xl Xl l lX+1 l+2

uilil+1il+2,klkl+1kl+2 Ql(kl, vl 1, vl)Vl(vl) Appendix A. Updating tensors of an MPS for −

simulating a three-qubit gate oper- Ql (kl , vl, vl )Vl (vl )Ql (kl , vl , vl ) . × +1 +1 +1 +1 +1 +2 +2 +1 +2 ation

The library ZKCM QC uses three-qubit gates as ele- First, we are going to compute the tensors Ql(il, vl 1, vl) − mentary gates in addition to single- and two-qubit gates. and Vl(vl) of the resultant state. The reduced density As it is not usually explained in detail how a three-qubit matrix of qubits 0, ..., l calculated from Eq. (A.1)f is gate is simulated in a time-dependent MPS simulation, a e detailed explanation is given here. As for simulations of 0,...,l 2 ρ = [Vl+2(vl+2)] single- and two-qubit gates, see Refs. [12, 27] for detailed ′ ′ ilvl−1i lv l−1il+1il+2vl+2 explanations. X X

Consider the quantum gate Θ(il,il+1,il+2, vl 1, vl+2)Θ∗(i′l,il+1,il+2, v′l 1, vl+2) × − − 111 111 vl 1 il v′l 1 i′l U = uilil+1il+2,klkl+1kl+2 × | − i| ih − |h |

ilil+1il+2=000 k k +1k +2=000 2 X l l Xl = [Vl+2(vl+2)] ′ ′ ilil+1il+2 klkl+1kl+2 ilvl−1i lv l−1il+1il+2vl+2 × | ih | X X Θ(il,il+1,il+2, vl 1, vl+2)Θ∗(i′l,il+1,il+2, v′l 1, vl+2) acting on the three qubits l, l + 1, and l + 2. With a focus × − − on the three qubits, the MPS of n qubits can be written ′ Vl 1(vl 1)Vl 1(v′l 1) Φvl−1 il Φv l−1 i′l × − − − − | i| ih |h | as ′ ′ ′ 111 ml−1 1,ml 1,ml+1 1,ml+2 1 = ai v − i v − Φv − il Φv − i′l − − − − l l 1 l l 1 l 1 l 1 ′ ′ | i| ih |h | i v − i v − Ψ = Ql(il, vl 1, vl) l l 1 l l 1 | i − X i i i =000 v − ,v ,v ,v =0,0,0,0 l l+1Xl+2 l 1 l l+1Xl+2 ′ ′ 0,...,l with ai v − i v − = . The matrix ρ Vl(vl)Ql (il , vl, vl )Vl (vl )Ql (il , vl , vl ) l l 1 l l 1 il+1il+2vl+2 × +1 +1 +1 +1 +1 +2 +2 +1 +2 ··· is a (2ml 1) (2ml 1h) matrix. This isi now diagonalized to 0,...,l 1 l+3,...,n 1 − − P − − × 2 vl 1 il il+1 il+2 vl+2 achieve the eigenvalues λv = [Vl(vl)] and the correspond- × | − i| i| i| i| i l 0,...,l 0,...,l 1 ing eigenvectors Φvl under the basis Φvl−1 − il . 0,...,l 1 0,...,l 1 l+3,...,n 1 | ei e {| i| i} with vl − = Vl 1(vl 1) Φv − − and vl − = Immediately we ﬁnd the values of Vl(vl). We may trun- 1 − − l 1 +2 | − l+3i ,...,n 1 | 0,...,li 1 | i e 11 Vl (vl ) Φ − , where Φ − are the Schmidt cate out negligibly small eigenvalues to reduce ml (the +2 +2 | vl+2 i | vl−1 i vectors of the block of qubits 0,...,l 1 for the splitting updated value of ml, namely, the numbere of data in Vl). l+3,...,n 1 − between l 1 and l, and Φvl+2 − are those of the block In addition, we have vector elements Cl(il, vl 1, vl)f of just − | i − of qubits l +3,...,n 1 for the splitting between l + 2 and computed eigenvectors: e l + 3. − 0,...,l 0,...,l 1 What we should do as a simulation of applying the quan- Φv = Cl(il, vl 1, vl) Φv − − il , | l i − | l 1 i| i tum gate U to Ψ is to update tensors Ql, Vl, Ql+1, Vl+1, | i ] ] from which we can derive and Ql+2 to Ql, Vl, Ql+1, Vl+1, and Ql+2, respectively, so e that the MPS with the updated tensors represents the re- Ql(il, vl 1, vl)= Cl(il, vl 1, vl)/Vl 1(vl 1). sultant state fΨ .e This processg is explained hereafter step − − − − by step. (Here| i is some note on the description: In case f l 1 = 1, thee tensor Vl 1 should be regarded as unity 11We should not employ a threshold for the number of eigenvalues − − − and the parameter vl 1 should be dropped. Similarly, in as we have discussed in Sec. 4.3. − 16 Thus we have obtained into the above equation to obtain

0,...,l Φv = Ql(il, vl 1, vl) vl 1 il . 1 l − − ] | i | i| i Ql+1(il+1, vl, vl+1)= Vl(vl)Vl+1(vl+1) Second,e we aref going to compute the tensors Vl+1(vl+1) ilil+2vl−1vl+2 ] X and Q (i , v , v ) from Eq. (A.1). The reduced 2 ∗ ] ∗ l+2 l+2 l+1 l+2 [Vl 1(vl 1)] Ql (il, vl e1, vl)Qgl+2 (il+2, vl+1, vl+2) density matrix of qubits l +2,...,n 1 is g − − − − 2 [Vl+2(vl+2f)] Θ(il,il+1,il+2, vl 1, vl+2) . l+2,...,n 1 2 × − ρ − = [Vl 1(vl 1)] ′ ′ − − i v i v i i v − l+2 l+2Xl+2 l+2 l lX+1 l 1 In this way, we have obtained the updated tensors

Θ(il,il+1,il+2, vl 1, vl+2)Θ∗(il,il+1,i′l+2, vl 1, v′l+2) ] × − − Ql(il, vl 1, vl), Vl(vl), Ql+1(il+1, vl, vl+1), − ] il+2 vl+2 i′l+2 v′l+2 V (v ), and Q (i , v , v ). × | i| ih |h | fl+1 l+1 e l+2 l+2 l+1 l+2 2 = [Vl 1(vl 1)] Theg described process to simulate a three-qubit gate is ′ ′ − − i v i v i i v − l+2 l+2Xl+2 l+2 l lX+1 l 1 implemented as the function “applyU8” in ZKCM QC. We Θ(il,il+1,il+2, vl 1, vl+2)Θ∗(il,il+1,i′l+2, vl 1, v′l+2) will see an example to use the function in the following. × − −

Vl (vl )Vl (v′l ) il Φv i′l Φv′ Appendix A.1. Example × +2 +2 +2 +2 | +2i| l+2 ih +2|h l+2 | Here is an example program to use a three-qubit oper- = bi v i′ v′ il Φv i′l Φv′ l+2 l+2 l+2 l+2 | +2i| l+2 ih +2|h l+2 | ation. It simulates the simplest case of Grover’s quantum i v i′ v′ l+2 l+2Xl+2 l+2 search [38]. One may also see the simulation performed in Sec. 4.3 for another example. with bi v i′ v′ = . The matrix l+2 l+2 l+2 l+2 ilil+1vl−1 ··· For a brief sketch of the quantum search, suppose that l+2,...,n 1 n ρ − is a (2ml+2) (2hPml+2) matrix.i This is now di- there is an oracle function f : 0, 1 0, 1 that has r × 2 { } →{ } agonalized to achieve the eigenvalues λvl+1 = [Vl+1(vl+1)] solutions w (namely, strings w such that f(w) = 1). Clas- l+2,...,n 1 sical search for finding a solution takes O(2n/r) queries. In and the corresponding eigenvectors Φvl+1 − under the | e gi contrast, the Grover search finds a solution in O( 2n/r) basis il Φv . The values of Vl (vl ) are immedi- {| +2i| l+2 i} +1 +1 queries. In particular when n = 4 and r = 1, a single query ately found. We may truncate out negligiblye small eigen- p is enough for the Grover search. For example, consider the values to reduce m]l (the updatedg value of ml ). In +1 +1 parent data set 00, 01, 10, 11 and the solution 01 for a addition, we have vector elements C (i , v , v ) of l+2 l+2 l+1 l+2 certain oracle with{ a single oracle} bit. A single Grover just computed eigenvectors: 1 iteration R maps s = 2 ( 00 + 01 + 10 + 11 ) o l+2,...,n 1 l+3,...,n 1 | i | i | i | i | i |−i Φ − = Cl (il , vl , vl ) il Φ − , to 01 o, where R = UsUf with Uf = 1 2 01 01 | vl+1 i +2 +2 +1 +2 | +2i| vl+2 i | i|−i − | ih | and Us = 1 2 s s acts on the left side qubits; o = from which we can derive − | ih | |−i e ( 0 o 1 o)/√2 is a state of the oracle qubit (the right- | i − | i ] most qubit). Uf is a unitary operation corresponding to Ql+2(il+2, vl+1, vl+2)= Cl+2(il+2, vl+1, vl+2)/Vl+2(vl+2). f and Us is a so-called inversion-about-average operation. Thus we have obtained To implement Uf and Us, we utilize the fact that flipping l+2,...,n 1 ] o changes the phase factor 1, following a common im- Φ − = Ql+2(il+2, vl+1, vl+2) il+2 vl+2 . |−i ± | vl+1 i | i| i plementation [11]. Third, we are going to compute the tensor Here, we consider a very simple oracle structure, namely e ] that, it flips when the left side qubits are in 01 using Ql+1(il+1, vl, vl+1). By the definition of this tensor, |−io | i we have a very straight-forward circuit interpretation. Note that there is no realistic benefit to perform a search in such a 0,...,l l+2,...,n 1 Φ il Φ Ψ ] vl +1 vl+1 − case because we know the solution beforehand. Usually a Ql+1(il+1, vl, vl+1)= h |h |h | i Vl(vl)Vl+1(vl+1) search is performed because we cannot guess the solutions e e e by the circuit appearance (consider, e.g, an instance of a 1 0,...,l = Φv vl 1 il satisfiability problem [58]). The sample program to simu- e gh l | − i| i Vl(vl)Vl+1(vl+1) i i v − v l l+2Xl 1 l+2 late the Grover search for the present simple case is shown e l+2,...,n 1 in Listing 3. eΦv g − il+2 vl+2 Θ(il,il+1,il+2, vl 1, vl+2) . × h l+1 | i| i − We substitutee the equations #include "zkcm_qc.hpp" int main() 0,...,l { Φv = Ql(il, vl 1, vl) vl 1 il | l i ]− | − i| i zkcm_set_default_prec(256); l+2,...,n 1 mps M(3); ( Φvl+1 − = Ql+2(il+2, vl+1, vl+2) il+2 vl+2 |e fi | i| i 17 e //Prepare the state for the parent data set and function, there are functions for typical three-qubit gates. //the initial state of the oracle qubit. They are briefly mentioned below. M.applyU(tensor2tools::Hadamard, 0); M.applyU(tensor2tools::Hadamard, 1); M.applyU(tensor2tools::PauliX, 2); Appendix A.2. Typical three-qubit gates M.applyU(tensor2tools::Hadamard, 2); The first typical three-qubit gate is the CCNOT gate already introduced in Sec. 4.3, which flips a sin- //Oracle function with the solution 01. gle target qubit t under the condition that two con- zkcm_matrix U_f(8, 8); trol qubits a and b are in 11 . ZKCM QC has U_f.set_to_identity(); | i U_f(2, 2) = U_f(3, 3) = 0; a particular member function (of class “mps”) for it: U_f(2, 3) = U_f(3, 2) = 1; “mps & mps::CCNOT (int a, int b, int t);”. The second typical one is the controlled-SWAP (CSWAP) //Inversion-about-average operation zkcm_matrix U_s(8, 8); gate, which swaps two target qubits p and q un- U_s.set_to_identity(); der the condition that the control qubit c is in 1 . U_s(0, 0) = U_s(1, 1) = 0; The member function corresponding to this gate| isi U_s(0, 1) = U_s(1, 0) = 1; mps & mps::CSWAP (int c, int p, int q); zkcm_matrix H = zkcm_matrix("[1, 1; 1, -1]") “ ”. / sqrt(zkcm_class(2)); In summary of this appendix, the process to simulate a zkcm_matrix I = zkcm_matrix("[1, 0; 0, 1]"); three-qubit gate operation in a time-dependent MPS simu- H = tensorprod(tensorprod(H, H), I); lation has been theoretically described. A sample program U_s = H * U_s * H;//This is 1-2|s>