Quick viewing(Text Mode)

Arxiv:1801.07686V4 [Quant-Ph] 2 Jun 2019

Arxiv:1801.07686V4 [Quant-Ph] 2 Jun 2019

A generative modeling approach for benchmarking and training shallow circuits

Marcello Benedetti,1, 2 Delfina Garcia-Pintos,3 Oscar Perdomo,3, 4, 5 Vicente Leyton-Ortega,3, 4 Yunseong Nam,6 and Alejandro Perdomo-Ortiz1, 3, 4, 7, 8, ∗ 1Department of Computer Science, University College London, WC1E 6BT London, UK 2Cambridge Limited, CB2 1UB Cambridge, UK 3Qubitera, LLC., Mountain View, CA 94041, USA 4Rigetti Computing, 2919 Seventh Street, Berkeley, CA 94710-2704, USA 5Department of Mathematics, Central Connecticut State University, New Britain, CT 06050, USA 6IonQ, Inc., College Park, MD 20740, USA 7Quantum Artificial Intelligence Lab., NASA Ames Research Center, Moffett Field, CA 94035, USA 8Zapata Computing Inc., 439 University Avenue, Office 535, Toronto, ON M5G 1Y8, Canada (Dated: June 2, 2019) Hybrid quantum-classical algorithms provide ways to use noisy intermediate-scale quantum com- puters for practical applications. Expanding the portfolio of such techniques, we propose a learning algorithm that can be used to assist the characterization of quantum devices and to train shallow circuits for generative tasks. The procedure leverages quantum hardware capabilities to its fullest extent by using native gates and their connectivity. We demonstrate that our approach can learn an optimal preparation of the Greenberger-Horne-Zeilinger states, also known as “cat states”. We further demonstrate that our approach can efficiently prepare approximate rep- resentations of coherent thermal states, wave functions that encode Boltzmann probabilities in their amplitudes. Finally, complementing proposals to characterize the power or usefulness of near-term quantum devices, such as IBM’s quantum volume, we provide a new hardware-independent metric called the qBAS score. It is based on the performance yield in a specific sampling task on one of the canonical data sets known as Bars and Stripes. We show how entanglement is a key ingredient in encoding the patterns of this data set; an ideal benchmark for testing hardware starting at four and up. We provide experimental results and evaluation of this metric to probe the trade off between several architectural circuit designs and circuit depths on an ion-trap quantum computer. Keywords: quantum-assisted machine learning, generative models, quantum circuit learning

I. INTRODUCTION tering synthetic data [10]. Other successful hybrid ap- proaches based on genetic algorithms were proposed for What is a good metric for the computational power approximating quantum adders and training quantum of noisy intermediate-scale quantum [1] (NISQ) devices? autoencoders [11–13]. In all these examples, there is a Can machine learning (ML) provide ways to benchmark clear-cut objective function describing the cost associated the power and usefulness of NISQ devices? How can with each candidate solution. The task is then to opti- we capture the performance scaling of these devices as mize it exploiting both quantum and classical resources. a function of circuit depth, gate fidelity, and qubit con- Although optimization tasks offer a great niche of ap- nectivity? In this work, we design a hybrid quantum- plications, probabilistic tasks involving sampling have classical framework called data-driven quantum circuit most of the potential to prove quantum advantage in learning (DDQCL) and address these questions through the near-term [14–16]. For example, learning probabilis- simulations and experiments. tic generative models is in many cases an intractable arXiv:1801.07686v4 [quant-ph] 2 Jun 2019 Hybrid quantum-classical algorithms, such as the vari- task, and quantum-assisted algorithms have been pro- ational quantum eigensolver [2,3] (VQE) and the quan- posed for both gate-based [17, 18] and tum approximate optimization algorithm [4,5] (QAOA), devices [19–23]. Differently from optimization, the learn- provide ways to use NISQ computers for practical ap- ing of a generative model is not always described by a plications. For example, VQE is often used in quan- clear-cut objective function. All we are given as input is tum chemistry when searching the of the a data set, and there are several cost functions that could electronic Hamiltonian of a molecule [6–8]. QAOA is be used as a guide, each one with their own advantages, used in combinatorial optimization to find approximate disadvantages, and assumptions. solutions of a classical Ising model [9] and it has been Here we present a hybrid quantum-classical approach demonstrated on a 19-qubit device for the task of clus- for generative modeling on gate-based NISQ devices which heavily relies on sampling. We use the 2N am- plitudes of the obtained from a N-qubit quantum circuit to construct and capture the correla- ∗ Correspondence: [email protected] tions observed in a data set. As Born’s rule determines 2 the output probabilities, this model belongs to the class II. RESULTS of models called Born machines [24]. Previous imple- mentations of Born machines [25–28] often relied on the The learning pipeline construction of tensor networks and their efficient ma- nipulation through graphical-processing units. Our work In this Section, we present a hybrid quantum-classical differs in that Born’s rule is naturally implemented by a algorithm for the unsupervised machine learning task quantum circuit executed on a NISQ hardware. Follow- of approximating an unknown ing the notation of subsequent work [29], we refer to our from data. This task is also known as generative model- quantum circuit Born machine generative model as a . ing. First, we describe the data and the model, then we present a training method. By employing the quantum circuit itself as a model for The data set = (x(1), , x(D)) is a collection of D the data set, we also differentiate from quantum algo- independent andD identically··· distributed random vectors. rithms that target specific probability distributions. For The underlying probabilities are unknown and the target example, in Ref. [30] the authors developed a hybrid al- is to create a model for such distribution. For simplicity, gorithm to approximately sample from a Boltzmann dis- we restrict our attention to N-dimensional binary vectors tribution. The samples are used to update a classical x(d) 1, +1 N , e.g. black and white images. This generative model, which requires running the hybrid al- gives∈ us {− an intuitive} one-to-one mapping between obser- gorithm for each training iteration. In contrast, our work vation vectors and the computational basis of an N-qubit does not assume a and therefore quantum system, that is x x = x1x2 xN . Note does not require a specific sampling beyond Born’s rule. that standard binary encodings↔ | i can| be used··· toi imple- ment integer, categorical, and approximate continuous Our work is also in contrast with other quantum- variables. assisted ML work (see e.g. Refs. [31–34]) requiring fault- Provided with the data set , our goal is now to tolerant quantum computers, which are not expected to obtain a good approximation toD the target probability be readily available in the near-term [35]. Instead, our distribution PD. A quantum circuit model with fixed circuits are carefully designed to exploit the full power depth and gate layout, parametrized by a vector θ, pre- of the underlying NISQ hardware without the need for a pares a wave function ψ(θ) from which probabilities are | i 2 compilation step. obtained according to Born’s rule Pθ(x) = x ψ(θ) . Following a standard approach from generative|h | ma-i| On the benchmarking of NISQ devices, quantum vol- chine learning [39], we can minimize the Kullback- ume [9, 36] has been proposed as an architecture-neutral Leibler (KL) divergence [40] DKL[PD Pθ] from the cir- metric. It is based on the task of approximating a specific cuit probability distribution in the computational| ba- class of random circuits and estimating the associated ef- sis Pθ to the target probability distribution PD. Min- fective error rate. This is very general and it is indeed imization of this quantity is directly related to the min- useful for estimating the computational power of a quan- imization of a well known cost function: the negative tum computer. In this paper, we propose the qBAS score, 1 PD (d) log-likelihood (θ) = D d=1 ln(Pθ(x )). However, a complementary metric designed for benchmarking hy- there is a caveat;C as− probabilities are estimated from brid quantum-classical systems. The score is based on frequencies of a finite number of measurements, low- the generative modeling performance on a canonical syn- amplitude states could lead to incorrect assignements. (d) (d) thetic data set which is easy to generate, visualize, and For example, an estimate Pθ(x ) = 0 for some x in validate for sizes up to hundreds of qubits. Yet, imple- the data set would lead to infinite cost. To avoid singu- menting a shallow circuit that can uniformly sample such larities in the cost function, we use a simple variant data is hard; we will show that some candidate solutions D require large amount of entanglement. Hence, any mis- 1 X (d) (θ) = ln(max(, Pθ(x ))), (1) calibration or environmental noise will affect this single Cnll −D performance number, enabling comparison between dif- d=1 ferent devices or across different generations of the same where  > 0 is a small number to be chosen. Note that the device. Moreover, the score depends on the classical re- number of measurements needed to obtain an unbiased sources, hyper-parameters, and various design choices, estimate of the relevant probabilities may not scale favor- making it a good choice for the assessment of the hybrid ably with the number of qubits N. In the Supplementary system as a whole. Material we suggest alternative cost functions such as the moment matching error and the earth mover’s distance. Our design choices are based on the setup of existing After estimating the cost, we update the parameter ion trap architectures. We choose the ion trap because vector θ to further minimize the cost. This can in princi- of its full connectivity among qubits [37] which allows ple be done by any suitable classical optimizer. We used us to study several circuit layouts on the same device. a gradient-free algorithm called particle swarm optimiza- Our experiments are carried out on an ion trap quantum tion (PSO) [41, 42] as previously done in the context of computer hosted at the University of Maryland [38]. [7]. The algorithm iterates for a fixed 3 number of steps, or until a local minimum is reached and 1 Initialize circuit with random parameters ✓ =(✓1, , ✓L) the cost does not decrease. ··· We chose the layout of the model circuit to be of 0 the following form. Let us consider a general circuit | i (l,k) 0 parametrized by single qubit rotations θ and two- … i | i 1 2 l L (l) { } (✓ ) (✓ ) … (✓ ) (✓ ) qubit entangling rotations θij . The subscripts denote U U U … U qubits involved in the operation,{ }l denotes the layer num- Measurements 0 ber and k 1, 2, 3 denotes the rotation identifier. The | i 2 latter is needed∈ { as we} decompose an arbitrary single qubit rotation into three simpler rotations (see SectionIV for further details about some exceptions and potential sim- 4 Update ✓. Repeat 2 through 4 until convergence plifications depending on the native gates available in each specific hardware). Inspired by the gates readily 3 Estimate mismatch between data and quantum outcomes available in ion trap quantum computers, we use alter- Reference data to be learned Outcome from quantum circuit nating layers of arbitrary single qubit gates (odd layers) and Mølmer-Sørensen XX entangling gates [43–46] (even layers) as our model. All parameters are initialized at random. It is important to note that in our model the number of parameters is fixed and is independent of the size D of the data set. This means we can hope to obtain a good ap- proximation to the target distribution only if the model is flexible enough to capture its complexity. Increasing FIG. 1. General framework for data-driven quantum circuit the number of layers or changing the topology of the en- learning (DDQCL). Data vectors are interpreted as represen- tangling layer alter this flexibility, potentially improving tative samples from an unknown probability distribution and the quality of the approximation. However, we antici- the task is to model such distribution. The 2N amplitudes pate that such flexible models are more challenging to of the wave function resulting from an N-qubit quantum cir- optimize because of their larger number of parameters. cuit are used to capture the correlations observed in the data. Principled ways to choose the circuit layout and to reg- Training of the quantum circuit is achieved by successive up- ularize its parameters could significantly help in such a dates of the parameters θ, corresponding to specifications of case. single qubit operations and entangling gates. In this work, we use arbitrary single qubit rotations for the odd layers, and As summarized in Figure1, the learning algorithm iter- Mølmer-Sørensen XX gates for the even layers. At each iter- atively adjusts all the parameters to minimize the value of ation, measurements from the quantum circuit are collected the cost function. At any iteration the user-defined cost is and contrasted with the data through evaluation of a cost approximated using both samples from the data set and function which tracks the learning progress. measurements from the quantum hardware, hence the name data-driven quantum circuit learning (DDQCL). complex sampling task that requires a fair amount of entanglement. It takes into account the model circuit The qBAS score depth, gate fidelities, and any other architectural design aspects, such as the quantum hardware’s qubit-to-qubit Bars and stripes (BAS) [47] is a synthetic data set connectivity and native set of single and two-qubit gates. of images that has been widely used to study generative It also takes classical resources such as the choice of cost models for unsupervised machine learning. For n m function, optimizer, and hyper-parameters, into account. n m × pixels, there are NBAS(n,m) = 2 + 2 2 images be- Therefore, it can be used to benchmark the performance longing to BAS, and they can be efficiently− produced and of the components of the hybrid quantum-classical sys- visualized. The probability distribution is 1/NBAS(n,m) tem. for each pattern belonging to BAS(n, m), and zero for The qBAS(n, m) score is an instantiation of the F1 any other pattern. Figure2 on the top left panel shows score widely used in the context of information retrieval. patterns belonging to BAS(2, 2), while the top central The F1 score is defined as the harmonic mean of the pre- panel shows the remaining patterns. cision p and the recall r, i.e. F1 = 2pr/(p + r). The We use DDQCL to learn a circuit that encodes all the precision p indicates the ability to retrieve states which BAS patterns in the wave function of a . belong to the data set of interest [48]. In our context This also allows us to design the qBAS(n, m) score: a this is the number of measurements that belong to the task specific figure of merit to assess the performance of BAS(n, m) data set, divided by the total number of mea- shallow quantum circuits. In a single number, it cap- surements Nreads performed. The recall r is the capacity tures the model capacity of the circuit layout and in- of the model to retrieve the whole spectrum of patterns trinsic hardware strengths and limitations in solving a belonging to the desired data set. In our case, if we de- 4

BAS patterns Non-BAS pa�erns 1 2 1 2 4 3 4 3

1 2 1 2 4 3 4 3

1 2 1 2 1 2 4 3 4 3 4 3

FIG. 2. DDQCL on the BAS data set. The top left panel shows patterns that belong to BAS(2, 2) our quantum circuit is to generate. The top central panel shows undesired patterns. On the top right panel, we show a possible mapping of the 4 pixels to N = 4 qubits, and we show some of the qubit-to-qubit connectivity topologies that can be set up in entangling layer and natively implemented by the ion trap quantum computer (e.g chain, star, and all). The bottom left panel shows the results of DDQCL simulations of shallow circuits with different topologies. We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence as learning progresses for 100 iterations. The mean-field-like circuit L = 1 (green crosses) severely underperforms. A significant improvement is obtained with L = 2, where most of the parameters for XX gates have been learned to their maximum entangling value. These observations indicate that entanglement is a key resource for learning the BAS data set. Note that for L = 2 the choice of topology becomes a key factor for improving the performance. The chain topology (purple squares) performs slightly better than the star topology (red stars) even though they have the same number of parameters. The all-to-all topology (orange circles) significantly outperform all the others as it has more expressive power. The bottom central image extends the previous analysis to deeper circuits with L = 4 and approximatively twice the number of parameters. All the topologies achieve a lower median KL divergence and the confidence intervals shrink. The bottom right panel shows the bootstrapped mean qBAS(2, 2) score and 95% confidence interval for simulations (green bars) and experiments on the ion trap quantum computer hosted at University of Maryland (pink bars).

note the number of unique patterns that were measured it is important to fix Nreads to a reasonable value such as d(N ), then r = d(N )/N . To score that r 1.0 under the assumption of the model distri- reads reads BAS(n,m) ≈ high (F1 1.0), both high precision (p 1.0) and high bution being equal to the target distribution PBAS(n,m), recall (r ≈1.0) are required. ≈ but not so large as to make the score insensitive to de- ≈ viations from the target distribution. Assuming a per- The F score is a useful measure for the quality of in- 1 fect model distribution PBAS(n,m) = 1/NBAS(n,m), the formation retrieval and classification algorithms, but for expected number of measurements to obtain a value of our purposes it has a caveat: the dependence of r on r = 1.0 can be estimated using the famous coupon collec- the total number of measurements. As an example, con- tor’s problem. For our purposes here, we set Nreads to be sider a model that generates only BAS patterns, i.e. its equal to the expected number of samples that need to be precision is 1.0, but with highly heterogeneous distribu- drawn to collect all the NBAS(n,m) patterns (“coupons”). tion. If some of the BAS patterns have infinitesimally That is, Nreads = NBAS(n,m)HN , where Hk is the small probability, we can still push the recall to 1.0 by BAS(n,m) k-th harmonic number. Computed values of Nreads are taking a large number of measurements N . reads → ∞ provided in TableI for different values of n and m up This is not desirable since our purpose is to evaluate cir- to 100 qubits. As shown in the table, the number of cuits on the task of uniformly sampling all the patterns readouts required to determine qBAS(n, m) are within from BAS(n, m). Therefore, to define a unique score 5 experimental capabilities of current NISQ devices [49]. π 0 0 Rx( ) For statistical robustness, we recommend as a good | i | i 2 0 π practice to perform R repetitions of the Nreads measure- π 0 Rx( 2 ) π | i GMS2n( 2 ) | i GMS2n+1( ) . . . 2 . ments leading to R independent estimates of the recall . . . . (each denoted as ri). For estimating the precision p, all π 0 0 Rx( ) the samples collected should be used to robustly estimate | i | i 2 this quantity. Using this value of p one can compute R independent values of the qBAS(n, m) score from each FIG. 3. GHZ-like state preparation assisted by DDQCL. Left of the ri. These are subsequently bootstrapped to ob- (right) panel shows a recipe obtained by a human expert as- tain a more robust average for the final reported value of sisted by DDQCL for the even (odd) cat state preparation. qBAS(n, m) (see details in SectionIV). Rx stands for the single qubit rotation about the x axis. GMS We note that a more general performance indicator stands for a global Mølmer-Sørensen gate [46] acting on all the N = 2n (N = 2n + 1) qubits and is equivalent to the appli- than qBAS(n, m) score is indeed the KL divergence, cation of local XX gates to all N(N − 1)/2 pairs of qubits. DKL[PBAS(n,m) Pθ]. However, this would not be robust All parameters attained values very close to π . The human in terms of scalability;| as n m becomes large, it is ex- 2 × expert rounded the parameters to some precision and found pected that the KL divergence is frequently undefined these patterns. [50]. This is true when measurements yield distributions (d) (d) such that Pθ(x ) = 0 for any of the x in BAS(n, m). In all these cases, the qBAS score can still be computed tions followed by an entangling layer with the all-to-all and the number of measurements Nreads necessary for ob- topology, DDQCL yielded many degenerate preparations taining a robust estimate continues to remain relatively of GHZ-like states differing only by a relative phase. small to be practical for intermediate size n m. × In particular, we first ran particle swarm optimization on 25 random initializations for 3, 4, 5 and 6-qubit in- stances. Then, a human expert inspected the best set of Experiments parameters learned for each size and after rounding the parameters to some precision, spotted a clear pattern. In- We investigated three different examples, namely, stances of 3 and 5 qubits yielded a recipe, while instances GHZ state preparation, coherent thermal state prepa- of 4 and 6 qubit yielded another recipe. The recipes ob- ration, and BAS(2, 2). We implemented each example tained are summarized in Figure3 and were verified for of DDQCL using both numerical simulations and exper- larger number of qubits, both odd and even. Indeed, iments. We explored the parameter space of DDQCL DDQCL successfully reproduced the recipes previously by varying the qubit-to-qubit connectivity topology (see used on ion trap quantum computers [55], and, to the top-right panel of Figure2) and the number of layers. We best of our knowledge, they correspond to the most com- evaluated the performance of each instance by using the pact and efficient protocols for GHZ state preparation KL divergence from the circuit probability distribution using XX gates (see Figure3). Another commonly used in the computational basis to the target probability dis- approach consists of cascading entangling gates, with al- tribution. While explicitly computing the KL divergence ternations of single qubit rotations [54]. DDQCL pro- is generally intractable, due to its demanding resource re- duced approximate recipes of this kind in some of the quirement as the size of instances to consider increases, test cases for 3 and 4 qubits when using a single entan- we were able to compute this quantity explicitly for all gling layer with chain topology. cases considered in this paper. It is interesting to note that in DDQCL all the param- eters are learned independently and not constrained to be the same. As shown in Figure3, the learning process GHZ state preparation unveiled that these converge to the same value. This is not necessarily the case for the other data sets consid- To test the capabilities of DDQCL, we started with ered below. We also note that the simulations assumed the preparation of GHZ states, also known as “cat noiseless hardware, making the analysis of parameters states” [51]. Besides their importance in quantum in- easier for the human expert. It would be much more formation, the choice is motivated by their simple de- difficult to analyse parameters found with noisy hard- scription and by the availability of many studies about ware, as DDQCL can learn to compensate certain types their preparation (see e.g. Refs. [52–54]). From the of noise, e.g., systematic parameter offsets, in non-trivial DDQCL perspective, we explored whether it is possible ways. The upside is that learning can be successful even to learn any of the known recipes for GHZ state prepa- in the presence of such systematic errors. ration starting only from classical data. More specifi- Finally, it is reassuring that DDQCL obtained circuits cally, the input data consists of samples from a distri- for cat state preparation starting from samples of a clas- bution corresponding to the two desired computational sical distribution. While this target distribution could basis states; P0 = 0.5 for the state 0 ... 0 and P1 = 0.5 have been modeled by a zero-temperature ferromagnet, for the state 1 ... 1 . Using a layer| of singlei qubit rota- there apparently was no other way for our circuit to re- | i 6 produce such a distribution, if not by preparing a GHZ- could achieve higher accuracy by requiring more compu- like state. One can obtain more general solutions by al- tational resources than the inverse Bethe approximation lowing DDQCL to prepare mixed states. For example, used here. A thorough comparison is beyond the scope consider a circuit acting on both the main qubit register of this work. and an additional ancilla register. By tracing out the ancilla register, e.g. ignoring it during measurement, the main register can implement a mixed state and can be BAS(2,2) trained to simulate a zero-temperature ferromagnet. An- other example which does not resort to an ancilla register For the purposes of benchmarking and measuring the is to use decoherence as a mechanism to prepare mixed power of NISQ devices with DDQCL, it is insufficient states that explain the data. to have an easy-to-generate target data set; we also re- quire the data set to represent a useful quantum state in quantum computing, while simultaneously proving to Coherent thermal states be sufficiently challenging for the quantum computer to generate. Because of the importance of entanglement Thermal states play an important role in statistical in processing, we considered the physics, quantum information, and machine learning. entanglement entropy averaged over all two-qubit sub- Using DDQCL, we trained quantum circuits with dif- sets [59] as a proxy measure of a specific quantum state’s ferent number of layers L 1, 2, 3 and using all-to-all usefulness for benchmarking purposes. We start by not- topology, to approximate∈ a {target} Boltzmann distribu- ing that the four-qubit cat state, whose rich entangled tion. In particular, we considered data sets sampled from nature makes it ideal for studying decoherence and de- the Boltzmann distribution of 25 synthetic instances with cay of quantum information [52, 54], has entanglement N = 5 qubits. By decreasing the temperature T of the entropy SGHZ = 1. Now consider states that encode target distribution, we can increase the difficulty of the BAS(2, 2) in the computational basis. The minimum learning task. Figure4 shows the bootstrapped median value of entanglement entropy that any such state can and 90% confidence interval over the distribution of me- have is SBAS(2,2) = 1.25163. Furthermore, the maxi- dians of the KL divergence during learning. Deeper cir- mum value that a quantum representation of BAS(2, 2) cuits such as L = 3 (purple pentagons) consistently out- can reach is SBAS(2,2) = 1.79248, which happens to be performed shallower circuits such as L = 2 (red circles) the maximum entanglement entropy known for any four- and L = 1 (yellow triangles). This became more evident qubit state [59] (see also Figure9 and the corresponding as we went from easy learning tasks (Figure4 (a)) to Section in the Supplementary Material). hard learning tasks (Figure4 (c)). Results for instances We also note that one of the quantum representa- of N = 6 qubits are shown in Figure6 in the Supplemen- tions of BAS(2, 2) found by DDQCL reached a remark- tary Material. able value of SBAS(2,2) = 1.69989 (see Figure7 in the To assess how well DDQCL performs on the generative Supplementary Material). This shows the power of our task, we compared DDQCL to the inverse Bethe approx- framework, in that DDQCL is capable of handling useful imation [56] (see also Eqs. (3.21) and (3.22) in Ref. [57]), quantum states that are rich in entanglement. This is a classical closed-form approach widely used in statisti- an important observation, since we know, based on our cal physics to solve the inverse Ising problem. As shown empirical results, that (i) single layer circuits with no in Figure4, the inverse Bethe approximation (green bar) entangling gates severely underperform in producing the performed extremely well in the easy task (a), matched output state probability distribution that is close to the the L = 3 quantum circuit in the intermediate task (b), target data set, and (ii) when inspecting the parameters and underperformed on the difficult task (c). The latter learned for circuits with all-to-all topology with L = 2 observation comes from the fact that the median perfor- layers, we found that most of the XX gates reached their mance of the inverse Bethe approximation has very large maximum entangling setting. confidence intervals. We emphasize that this is not a We now discuss results for the qBAS(2, 2) score, which form of as the two methods are fun- we computed experimentally and theoretically in order damentally different. DDQCL prepares a quantum state to compare the entangling topologies sketched in the top without the assumption of an underlying Boltzmann dis- right panel of Figure2 and for different number of layers. tribution, while the inverse Bethe approximation infers The process consists of two steps; first, DDQCL is used the parameters with such assumption. Furthermore, the to encode BAS(2, 2) in the wave function of the quantum error in the inverse Bethe approximation is expected to state. Second, the best circuits, i.e. those with lowest go to zero with system size, and only above the reference cost, are compared using the qBAS(2, 2) score. temperature Tc (see SectionIV for details). Thus, it is The bottom left and bottom central panels in Figure2 not surprising that we obtained bad performance in Fig- show the bootstrapped median of the KL divergence and ure4 (c) with the inverse Bethe approximation. Other 90% confidence interval over 25 random initializations of classical methods based on machine learning and Markov DDQCL in silico. The all-to-all topology (orange cir- Chain Monte Carlo, such as the [58], cles) always outperforms sparse topologies (red stars and 7

(a) T = 2 TC (b) T = TC (c) T = TC /1.5

FIG. 4. DDQCL preparation of coherent thermal states. We generated 25 random instances of size N = 5 and varied the difficulty of the learning task by decreasing the temperature in T ∈ {2Tc,Tc,Tc/1.5} where Tc is the reference temperature (see SectionIV for details). The model is a quantum circuit with five qubits and an all-to-all qubit connectivity for the entangling layer. We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence of DDQCL as learning progresses for 50 iterations. (a) When T > Tc, the learning task is easy and shallow quantum circuits such as L = 1 (yellow triangles) and L = 2 (red circles) perform very well. (b) When T ≈ Tc, a gap in performance between circuits of different depth becomes evident. (c) When T < Tc, the learning task becomes hard and deeper circuits perform much better than shallow ones. We also report results for the inverse Bethe approximation, which does not actually prepare a state, but produces a classical model in closed-form. The classical model so obtained (green band) is excellent for the easy task in (a), matches the best quantum model in (b), and underperforms for the hard task in (c). purple squares). However, deeper circuits do not always Although we compared circuits implemented on the provide significant improvements, as it is the case for all- same ion trap hardware, the score may be used to com- to-all L = 4 (dark green circles) versus all-to-all L = 2 pare different device generations or even completely dif- (orange circles). A possible explanation is that, when ferent architectures (e.g. superconductor-based versus going from two to four layers, we approximately double atomic-based). Similarly, one may use the score to com- the number of parameters, and particle swarm optimiza- pare classical resources of the hybrid system (e.g. differ- tion struggles to find enhanced local optima. Another ent optimizers). plausible explanation is that for this small data set, the all-to-all circuits with L = 2 are already close to optimal performance (we show supporting evidence in Figure7 III. DISCUSSION in the Supplementary Material).

As the best performing circuits to compare using the Data is an essential ingredient of any machine learning qBAS score, we chose all-to-all L = 2 and star L 2, 4 task. In this work, we presented a data-driven quantum circuits. While they represent very different approximate∈ { } circuit learning algorithm (DDQCL) as a framework that solutions to the same problem, they may be compared can assist in the characterization of NISQ devices and to with the help of qBAS(2, 2) score. For each setting, we implement simple generative models. The success of this computed 25 scores from batches of size Nreads = 15 sam- approach is evidenced by the results on three different ples, as described in SectionII. The bottom right panel data sets. in Figure2 shows the bootstrapped mean qBAS(2 , 2) To summarize, first, we learned a GHZ state prepara- score and 95% confidence interval for simulations (green tion recipe for an ion trap quantum computer. Minimal bars) and experiments on the ion trap quantum com- intervention by a human expert allowed to generalize the puter hosted at University of Maryland (pink bars). The recipe to any number of qubits. This is not an example score is sensitive to the depth of the circuit as shown of compilation, but rather an illustration of how simple by the performance improvement of L = 4 compared to classical probability distributions can guide the synthesis L = 2 in the star topology. Note that the theoretical of interesting non-trivial quantum states. Depending on improvement for using L = 4 is larger than that ob- the level or type of noise in the system, the same algo- served experimentally in the ion trap. This is because rithm could lead to a different circuit fulfilling the same the quantum computer accumulated errors while execut- probability distribution as that of the data. The message ing the deeper circuit. The score is also sensitive to the here is that machine learning can teach us that “there is choice of topology as shown by the drop in performance more than one way to skin a cat (state)”. of star compared to all-to-all when the same number of Second, we trained circuits to prepare approximations layers L = 2 is used. of thermal states. This illustrates the power of Born 8 machines [24] to approximate Boltzmann machines [58] future work. when the data require thermal-like features. Our approach has the bidirectional capability of using Finally, tapping into the real power of near-term quan- NISQ devices for machine learning, and machine learning tum devices and approximate algorithms implementable for the characterization of NISQ devices. We hope the on them, we designed a task-specific performance esti- ideas presented here contribute to the development of mator based on a canonical data set. The bars and further concrete metrics to help guide the architectural stripes data is easy to generate, visualize and verify clas- hardware design, while tapping into the computational sically, while modeling it still requires significant quan- power of NISQ devices. tum resources in the form of entanglement. Errors in the device will affect this single performance measure, the qBAS score, which can be used to compare different IV. METHODS device generations, or completely different architectures. The qBAS score can also be used to benchmark the typ- ical performance of optimizers used in hybrid quantum- classical systems. Selecting the method and optimiz- ing the hyper-parameters can be a daunting task and Simulation of quantum circuits in silico is a key challenge towards a successful implementation as the number of qubits increases. Therefore, having this We simulated quantum circuits using the QuTiP2 [65] unique metric for benchmarking could help reduce the Python library and implemented the constraints dictated complexity of this fine-tuning stage. The score can be by the ion trap experimental setting. In the current ex- computed in any of the NISQ architectures available to perimental setup, we can perform arbitrary single qubit date. rotations and Mølmer-Sørensen XX entangling gates in- DDQCL is a modular framework and its performance volving any two qubits. We used only these gates hence will ultimately depend on the choices made for such mod- avoiding the need of further compilation. For the simula- ules. In this article we explored the impact of circuit lay- tions in silico, we also assume perfect gate fidelities and out and cost function, while subsequent work has anal- error-free measurements. ysed other modules and suggested extensions to the al- In the ion trap setting, the implementation of single gorithm. In Ref. [29] the authors trained Born machines qubit rotations Rz is very convenient. Therefore, we using a differentiable cost function and exploiting gradi- perform arbitrary single qubit operations relying on the ent calculations proposed in Ref. [60]. In Refs. [61, 62] (l) (l,3) (l,2) (l,1) decomposition Ui = Rz(θi )Rx(θi )Rz(θi ), the authors compared several optimizers, and in Ref. [63] where l is the layer number, i is the qubit in- the authors focused on the impact of hardware noise. (l,k) dex, and θ [ π, +π] are Euler angles. The The expressive power of shallow circuits was investigated i rotations are then∈ − expressed as exponentials of in Ref. [64] and it was shown to outweigh that of some (l,·) i (l,·) z classes of artificial neural networks. Finally, recent work Pauli operators Rz(θi ) = exp( 2 θi σi ) and (l,·) i (l,·) x − has shown successfull implementation of DDQCL on the Rx(θi ) = exp( 2 θi σi ). IBM Q 20 Tokyo processor [63], on a five-qubit ion-trap Because we execute− circuits always starting from the hosted at University of Maryland [61], and on the Rigetti 0 0 state, the first set of Rz rotations would have no 16Q Aspen-1 processor [62]. |effect··· and,i therefore, is not needed. When an odd num- It is left to future work to demonstrate more realistic ber of layers is used, a similar exception occurs in the last machine learning by allowing more flexible models and layer. There, the last set of Rz rotations would only add a employing regularization. At a finite and fixed low cir- phase that becomes irrelevant when taking the amplitude cuit depth, the power of the generative model can be en- squared required for the Born machine. In other words, hanced by including ancilla qubits, in analogy to the role we can slightly reduce the number of parameter without of hidden units in probabilistic graphical models. Regu- changing the expressive power of the circuit. Every other larization can be included in the cost function. In this layer of arbitrary single qubit operations would in gen- paper, we used the negative log-likelihood which quickly eral require 3N parameters, where N is the number of becomes expensive to estimate as the size of the system qubits. By using an alternative decomposition, namely increases. We reported preliminary results on alternative U = RxRzRx, we could apply commutation rules with cost-functions that overcome the caveat and still produce XX gates and obtain a reduction to 2N parameters in satisfactory results. Layer-wise pre-training of the quan- all odd layers. We decided not to do the former step tum circuit inspired by [39] could initialize for two reasons. First, there is no effective reduction of parameters to near-optimal locations in the cost land- the number of parameters for experiments up to L = 5 scape. Finally, DDQCL could be generalized to learn layers considered here. Second, in the ion trap quantum quantum distributions or states, assuming experimental computer used here [38] it is experimentally convenient data coming from quantum experiments, e.g. quantum to use a larger number of Rz rather than Rx rotations. measurements beyond the computational basis. We think For the case of the entangling gates, we use the nota- (l) (l) these are the most promising directions to be explored in tion Uij = XX(θij ), which in exponential form reads as 9

(l) i (l) x x XX(θij ) = exp( 2 θij σi σj ). Recalling that states that 1000 data points sampled exactly from these distribu- differ by a global− phase are indistinguishable, a direct tions. computation shows that the tunable parameters can be (l) taken as θij [ π, +π]. Also, there is no need to set up an order for∈ these− gates within an entangling layer as GHZ state preparation they commute with one another. The total number of parameters per entangling layer The zero-temperature ferromagnet distribution is depends on the chosen topology: all is a fully-connected equivalent to assigning 1/2 probability to both 0 ... 0 | i graph and has N(N 1)/2 parameters, chain is a one- and 1 ... 1 states of the computational basis. This dis- | i dimensional nearest neighbor− graph with N 1 parame- tribution can be easily prepared as a mixed state, but ters, and star is a star-shaped graph with N− 1 param- our study uses pure states prepared by the circuit. The eters. The top right panel of Figure2 shows a− graphical only way to reproduce the zero-temperature ferromag- representation of these topologies for the case of N = 4 net distribution in our setting is to implement a unitary qubits. transformation that prepares a GHZ-like state. When executing DDQCL, we always estimated the re- quired quantities from 1000 measurements in the compu- tational basis. Thermal states

A thermal data set in N dimensions is generated by ex- N Gradient-free optimization act sampling realizations of x 1, +1 from the dis- −1 P∈ {− } P −1 tribution P (x) = Z exp(( ij Jijxixj + i hixi)T ) where Z is the normalization constant, J and h are Once the number of layers and topology of entangling ij i random coefficients sampled from a normal distribution gates is fixed, the quantum circuits described above pro- with zero mean and √N standard deviation, and T is vide a template; by adjusting the parameters we can the temperature. In the large system-size limit, a phase implement a small subset of the unitaries that are in transition is expected at T 1. Although this is not principle allowed in the Hilbert space. The variational c true for the small-sized systems≈ considered here, we take approach aims at finding the best parameters by mini- this value as a reference temperature. In our study, we mizing a cost function. For all our tests, we choose to vary T 2T ,T ,T /1.5 in order to generate increas- minimize a clipped version of the negative log-likelihood. c c c ingly complex∈ { instances. } We use a global-best particle swarm optimization al- gorithm [42] implemented in the PySwarms [66] Python library. A ‘particle’ corresponds to a candidate solution Bars and Stripes circuit; the position of a particle is a point θ in parameter space, the velocity is a vector determining how to update the position in parameter space. Position and velocity of BAS [47] is a canonical machine learning data set for testing generative models. It consists of n m pixel pic- all the particles are initialized at random and updated at × each iteration following the schema shown in Figure1. tures generated by setting each row (or column) to either black ( 1) or white (+1), at random. In generative mod- There are three hyper-parameters controlling the swarm − dynamics: a cognition coefficient c , a social coefficient eling, a handful of patterns are input to the algorithm 1 and the target is to train a model to capture correlations c2 and an inertia coefficient w. After testing a grid of values, we chose to use a constant value of 0.5 for all in the data. Assuming a successful training, the model three hyper-parameters, which we found to work well for can reconstruct and generate previously unseen patterns our purpose. To avoid large jumps in parameter space, from partial or corrupted data. On the other hand, if we all we further restrict position updates in each dimension to provide the algorithm with the patterns we are inter- a maximum magnitude of π. ested in and aim to a model that generates only those, this would amount to an associative memory. Although Finally, we set the number of particles to twice the both tasks can be done with our DDQCL pipeline, for number of parameters of the circuit. This is a conserva- the qBAS(n, m) score we focus on the latter task. tive value compared to previous work [8], also because of We now determine the number of patterns and pro- the large number of parameters in the circuits explored vide an easy identification of the bitstring belonging to here. the BAS(n, m) class. For the total count of the number of patterns, we first count the number of single stripes, dou- ble stripes, etc. that can fit into the n rows. This number Data sets details Pn n n is the sum of binomial coefficients k=0 k = 2 . The same expression holds for the number of patterns with We worked with three synthetic data sets: zero- bars that can be placed in the m columns, that is 2m. temperature ferromagnet, random thermal, and bars and Note that empty (all-white) and full (all-black) patterns stripes (BAS). In all our numerical experiments we use are counted in both the bars and the stripes. Therefore, 10 we obtain the total count for the BAS patterns by sub- tained error bars from two standard deviations, account- tracting the two extra patterns from this double count: ing for a 95% confidence interval.

N = 2n + 2m 2. (2) BAS(n,m) − DATA AVAILABILITY

In the main text, we use the BAS data set to design a All data needed to evaluate the conclusions are avail- task-specific performance indicator for hybrid quantum- able from the corresponding author upon request. classical systems. TableI shows the requirements for some values of n and m. ACKNOWLEDGEMENTS

(n, m) N N N qubits BAS(n,m) reads M.B. was supported by the UK Engineering and Phys- (2,2) 4 6 15 (2,3) 6 10 30 ical Sciences Research Council (EPSRC) and by Cam- (3,3) 9 14 46 bridge Quantum Computing Limited (CQCL). The au- (4,4) 16 30 120 thors are very grateful to Prof. Christopher Monroe and (7,7) 49 254 1554 his team at the University of Maryland (UMD) for their (8,8) 64 510 3475 support in running the experiments presented here. Spe- (10,10) 100 2046 16780 cial thanks to N.M. Linke, C. Figgatt, K.A. Landsman, and D. Zhu for useful discussions and for running the sev- TABLE I. Example of experimental requirements for near- eral experiments used to test and validate the pipeline of term quantum computers with up to 100 qubits. As described this work, and the ones presented in the Results section. in the main text, Nreads is the number of readouts required The authors acknowledge E. Edwards from the communi- for every estimation of the qBAS score. cations/publicity division at the Joint Quantum Institute at UMD for the rendering of the ion-trap graphic used in Figure2, and would like to thank J. Realpe-Gomez, A.M. Wilson, and G. Paz-Silva for useful discussions and feedback on an early version of this manuscript. The au- thors would like to thank J.I. Latorre for pointing out Bootstrapping analysis Ref. [59].

To obtain error bars for the KL divergence, we used AUTHOR CONTRIBUTIONS the following procedure. DDQCL was always executed 25 times with random initialization of the parameters. M.B. and A.P-O designed the generative model and the From the 25 repetitions, we sampled 10,000 data sets of learning algorithm. M.B., Y.N., and A.P-O designed the size 25 with replacement and computed the median KL data sets and the experiments. M.B. and V.L-O wrote divergence for each. From the distribution of 10,000 me- the code and performed the in silico experiments. D.G-P dians, we computed the median and obtained error bars wrote code for the analysis and figures. A.P-O designed from the 5-th and 95-th percentiles as the lower and up- the qBAS score. O.P. performed the analysis related to per limits, respectively, accounting for a 90% confidence entanglement entropies. All authors analyzed the exper- interval. imental results and contributed to the final version of the For the case of qBAS score, we did the following boot- manuscript. strap analysis. qBAS score was always computed 25 times from batches of samples Nreads. From the 25 rep- etitions, we sampled 10,000 data sets of size 25 with re- COMPETING INTERESTS placement and computed the mean for each. From the distribution of 10,000 means, we computed mean and ob- The authors declare no competing interests.

[1] John Preskill, “Quantum computing in the NISQ era and munications 5, 4213 EP – (2014). beyond,” Quantum 2, 79 (2018). [3] Jarrod R McClean, Jonathan Romero, Ryan Babbush, [2] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man- and Al´an Aspuru-Guzik, “The theory of variational Hong Yung, Xiao-Qi Zhou, Peter J. Love, Al´anAspuru- hybrid quantum-classical algorithms,” New Journal of Guzik, and Jeremy L. O’Brien, “A variational eigenvalue Physics 18, 023023 (2016). solver on a photonic quantum processor,” Nature Com- 11

[4] Sam Gutmann Edward Farhi, Jeffrey Goldstone, [16] Aram W. Harrow and Ashley Montanaro, “Quantum “A quantum approximate optimization algorithm,” computational supremacy,” Nature 549, 203–209 (2017). arXiv:1411.4028 (2014). [17] Krysta M. Svore Nathan Wiebe, Ashish Kapoor, “Quan- [5] Stuart Hadfield, Zhihui Wang, Bryan O’Gorman, tum deep learning,” arXiv:1412.3489 (2015). Eleanor G Rieffel, Davide Venturelli, and Rupak Biswas, [18] M´ariaKieferov´aand Nathan Wiebe, “Tomography and “From the quantum approximate optimization algorithm generative training with quantum boltzmann machines,” to a quantum alternating ansatz,” Algorithms Phys. Rev. A 96, 062327 (2017). 12, 34 (2019). [19] Marcello Benedetti, John Realpe-G´omez,Rupak Biswas, [6] PJJ O’Malley, Ryan Babbush, ID Kivlichan, Jonathan and Alejandro Perdomo-Ortiz, “Estimation of effective Romero, JR McClean, Rami Barends, Julian Kelly, Pe- temperatures in quantum annealers for sampling appli- dram Roushan, Andrew Tranter, Nan Ding, et al., “Scal- cations: A case study with possible applications in deep able quantum simulation of molecular energies,” Physical learning,” Phys. Rev. A 94, 022308 (2016). Review X 6, 031007 (2016). [20] Mohammad H Amin, Evgeny Andriyash, Jason Rolfe, [7] J. I. Colless, V. V. Ramasesh, D. Dahlen, M. S. Blok, Bohdan Kulchytskyy, and Roger Melko, “Quantum M. E. Kimchi-Schwartz, J. R. McClean, J. Carter, W. A. boltzmann machine,” Physical Review X 8, 021050 de Jong, and I. Siddiqi, “Computation of molecular spec- (2018). tra on a quantum processor with an error-resilient algo- [21] Marcello Benedetti, John Realpe-G´omez,Rupak Biswas, rithm,” Phys. Rev. X 8, 011021 (2018). and Alejandro Perdomo-Ortiz, “Quantum-assisted learn- [8] Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, ing of hardware-embedded probabilistic graphical mod- Maika Takita, Markus Brink, Jerry M Chow, and els,” Phys. Rev. X 7, 041052 (2017). Jay M Gambetta, “Hardware-efficient variational quan- [22] Marcello Benedetti, John Realpe-G´omez, and Alejandro tum eigensolver for small molecules and quantum mag- Perdomo-Ortiz, “Quantum-assisted helmholtz machines: nets,” Nature 549, 242 (2017). A quantum–classical deep learning framework for indus- [9] Nikolaj Moll, Panagiotis Barkoutsos, Lev S Bishop, trial datasets in near-term devices,” Quantum Science Jerry M Chow, Andrew Cross, Daniel J Egger, Stefan Fil- and Technology 3, 034007 (2018). ipp, Andreas Fuhrer, Jay M Gambetta, Marc Ganzhorn, [23] Peter Wittek and Christian Gogolin, “Quantum en- Abhinav Kandala, Antonio Mezzacapo, Peter Mller, Wal- hanced inference in Markov logic networks,” Scientific ter Riess, Gian Salis, John Smolin, Ivano Tavernelli, Reports 7 (2017), 10.1038/srep45672. and Kristan Temme, “Quantum optimization using varia- [24] Song Cheng, Jing Chen, and Lei Wang, “Information tional algorithms on near-term quantum devices,” Quan- perspective to probabilistic modeling: Boltzmann ma- tum Science and Technology 3, 030503 (2018). chines versus born machines,” Entropy 20, 583 (2018). [10] J. S. Otterbach, R. Manenti, N. Alidoust, A. Bestwick, [25] Edwin Stoudenmire and David J Schwab, “Supervised M. Block, B. Bloom, S. Caldwell, N. Didier, E. Schuyler learning with tensor networks,” in Advances in Neural Fried, S. Hong, P. Karalekas, C. B. Osborn, A. Pa- Information Processing Systems 29 , edited by D. D. Lee, pageorge, E. C. Peterson, G. Prawiroatmodjo, N. Ru- M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett bin, C. A. Ryan, D. Scarabelli, M. Scheer, E. A. Sete, (Curran Associates, Inc., 2016) pp. 4799–4807. P. Sivarajah, R. S. Smith, A. Staley, N. Tezak, W. J. [26] Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Zeng, A. Hudson, B. R. Johnson, M. Reagor, M. P. da Pan Zhang, “Unsupervised generative modeling using Silva, and C. Rigetti, “Unsupervised machine learn- matrix product states,” Physical Review X 8 (2018), ing on a hybrid quantum computer,” arXiv preprint 10.1103/physrevx.8.031012. arXiv:1712.05771 (2017). [27] Ding Liu, Shi-Ju Ran, Peter Wittek, Cheng Peng, [11] Jonathan Romero, Jonathan P Olson, and Alan Aspuru- Raul Bl´azquezGarc´ıa,Gang Su, and Maciej Lewenstein, Guzik, “Quantum autoencoders for efficient compression “Machine learning by two-dimensional hierarchical tensor of quantum data,” Quantum Sci. Technol. 2, 045001 networks: A quantum information theoretic perspective (2017). on deep architectures,” arXiv preprint arXiv:1710.04833 [12] Lucas Lamata, Unai Alvarez-Rodriguez, Jos´e Mart´ın- (2017). Guerrero, Mikel Sanz, and Enrique Solano, “Quan- [28] Xun Gao, Zhengyu Zhang, and Luming Duan, “A tum autoencoders via quantum adders with genetic al- learning algorithm based on genera- gorithms,” Quantum Science and Technology (2018), tive models,” Science Advances 4 (2018), 10.1126/sci- 10.1088/2058-9565/aae22b. adv.aat9004. [13] Rui Li, Unai Alvarez-Rodriguez, Lucas Lamata, and [29] Jin-Guo Liu and Lei Wang, “Differentiable learning of Enrique Solano, “Approximate quantum adders with ge- quantum circuit born machines,” Phys. Rev. A 98, netic algorithms: An IBM quantum experience,” Quan- 062324 (2018). tum Measurements and 4, 1–7 [30] Guillaume Verdon, Michael Broughton, and Jacob Bia- (2017). monte, “A to train neural networks [14] Edward Farhi and Aram W. Harrow, “Quantum using low-depth circuits,” arXiv:1712.05304 (2017). supremacy through the quantum approximate optimiza- [31] Seth Lloyd, Masoud Mohseni, and Patrick Reben- tion algorithm,” arXiv:1602.07674 (2016). trost, “Quantum principal component analysis,” Nature [15] Alejandro Perdomo-Ortiz, Marcello Benedetti, John Physics 10, 631–633 (2014). Realpe-G´omez, and Rupak Biswas, “Opportunities and [32] Iordanis Kerenidis and Anupam Prakash, “Quantum rec- challenges for quantum-assisted machine learning in ommendation systems,” arXiv preprint arXiv:1603.08675 near-term quantum computers,” Quantum Science and (2016). Technology 3, 030502 (2018). [33] Fernando G. S. L. Brandao, Amir Kalev, Tongyang Li, Cedric Yen-Yu Lin, Krysta M. Svore, and Xiaodi 12

Wu, “Exponential quantum speed-ups for semidefinite we can choose alternative scalable cost functions, as we programming with applications to quantum learning,” show in the Supplementary Material. A comprehensive arXiv:1710.02581 (2017). study is left for future work. [34] Maria Schuld, Mark Fingerhuth, and Francesco Petruc- [51] Daniel M. Greenberger, Michael A. Horne, Abner Shi- cione, “Implementing a distance-based classifier with a mony, and , “Bell’s theorem without in- quantum interference circuit,” EPL (Europhysics Let- equalities,” American Journal of Physics 58, 1131–1143 ters) 119, 60002 (2017). (1990). [35] Masoud Mohseni, Peter Read, , Sergio [52] Thomas Monz, Philipp Schindler, Julio T. Barreiro, Boixo, Vasil Denchev, Ryan Babbush, Austin Fowler, Michael Chwalla, Daniel Nigg, William A. Coish, Max- Vadim Smelyanskiy, and John Martinis, “Commercialize imilian Harlander, Wolfgang H¨ansel,Markus Hennrich, quantum technologies in five years,” Nature 543, 171–174 and Rainer Blatt, “14-qubit entanglement: Creation and (2017). ,” Phys. Rev. Lett. 106, 130506 (2011). [36] Lev S Bishop, Sergey Bravyi, Andrew Cross, Jay M Gam- [53] Andrea Rocchetto, Scott Aaronson, Simone Severini, betta, and John Smolin, “Quantum volume,” (2017). Gonzalo Carvacho, Davide Poderini, Iris Agresti, Marco [37] Norbert M Linke, Dmitri Maslov, Martin Roetteler, Bentivegna, and Fabio Sciarrino, “Experimental learn- Shantanu Debnath, Caroline Figgatt, Kevin A Lands- ing of quantum states,” Science Advances 5 (2019), man, Kenneth Wright, and Christopher Monroe, “Ex- 10.1126/sciadv.aau1946. perimental comparison of two quantum computing archi- [54] Asier Ozaeta and Peter L McMahon, “Decoherence of up tectures,” Proceedings of the National Academy of Sci- to 8-qubit entangled states in a 16-qubit superconducting ences , 201618020 (2017). quantum processor,” Quantum Science and Technology [38] Shantanu Debnath, Norbert M Linke, Caroline Figgatt, 4, 025015 (2019). Kevin A Landsman, Kevin Wright, and Christopher [55] Thomas Monz, Philipp Schindler, Julio T. Barreiro, Monroe, “Demonstration of a small programmable quan- Michael Chwalla, Daniel Nigg, William A. Coish, Max- tum computer with atomic qubits,” Nature 536, 63 EP imilian Harlander, Wolfgang H¨ansel,Markus Hennrich, – (2016). and Rainer Blatt, “14-qubit entanglement: Creation and [39] and Aaron Courville, coherence,” Phys. Rev. Lett. 106, 130506 (2011). “Deep learning,” (2016), MIT Press. [56] Federico Ricci-Tersenghi, “The bethe approximation for [40] S. Kullback and R. A. Leibler, “On information and suffi- solving the inverse ising problem: a comparison with ciency,” The Annals of Mathematical Statistics 22, 79–86 other inference methods,” Journal of Statistical Mechan- (1951). ics: Theory and Experiment 2012, P08015 (2012). [41] James Kennedy and Russell Eberhart, “Particle swarm [57] Iacopo Mastromatteo, “On the typical properties of in- optimization,” in Proceedings of ICNN’95 - International verse problems in statistical ,” arXiv preprint Conference on Neural Networks, Vol. 4 (1995) pp. 1942– arXiv:1311.0190 (2013). 1948 vol.4. [58] David H Ackley, Geoffrey E Hinton, and Terrence J Se- [42] Yuhui Shi and Russell Eberhart, “A modified particle jnowski, “A learning algorithm for boltzmann machines,” swarm optimizer,” in 1998 IEEE International Confer- Cognitive science 9, 147–169 (1985). ence on Evolutionary Computation Proceedings. IEEE [59] Atsushi Higuchi and Anthony W. Sudbery, “How entan- World Congress on Computational Intelligence (Cat. gled can two couples get?” Physics Letters A 273, 213 – No.98TH8360) (1998) pp. 69–73. 217 (2000). [43] Anders Sørensen and Klaus Mølmer, “Quantum compu- [60] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, tation with ions in thermal ,” Phys. Rev. Lett. 82, and Keisuke Fujii, “Quantum circuit learning,” Phys. 1971–1974 (1999). Rev. A 98, 032309 (2018). [44] Anders Sørensen and Klaus Mølmer, “Entanglement and [61] D. Zhu, N. M. Linke, M. Benedetti, K. A. Landsman, quantum computation with ions in thermal motion,” N. H. Nguyen, C. H. Alderete, A. Perdomo-Ortiz, N. Ko- Phys. Rev. A 62, 022311 (2000). rda, A. Garfoot, C. Brecque, L. Egan, O. Perdomo, and [45] Jan Benhelm, Gerhard Kirchmair, Christian F. Roos, C. Monroe, “Training of quantum circuits on a hybrid and Rainer Blatt, “Towards fault-tolerant quantum com- quantum computer,” arXiv preprint arXiv:1812.08862 puting with trapped ions,” Nature Physics 4, 463 EP – (2018). (2008). [62] Vicente Leyton-Ortega, Alejandro Perdomo-Ortiz, and [46] Dmitri Maslov and Yunseong Nam, “Use of global inter- Oscar Perdomo, “Robust implementation of generative actions in efficient quantum circuit constructions,” New modeling with parametrized quantum circuits,” arXiv Journal of Physics (2017), 10.1088/1367-2630/aaa398. preprint arXiv:1901.08047 (2019). [47] David J. C. MacKay, Information Theory, Inference & [63] Kathleen E. Hamilton, Eugene F. Dumitrescu, Learning Algorithms (Cambridge University Press, New and Raphael C. Pooser, “Generative model bench- York, NY, USA, 2002). marks for superconducting qubits,” arXiv preprint [48] The meaning and usage of precision in the field of infor- arXiv:1811.09905 (2018). mation retrieval differs from the definition of precision [64] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and within other branches of science and statistics. Dacheng Tao, “The expressive power of parameter- [49] It is an interesting problem to optimize Nreads such that ized quantum circuits,” arXiv preprint arXiv:1810.11922 it maximizes the sensitivity of the score towards differ- (2018). entiating two probability distributions. This problem is [65] J.R. Johansson, P.D. Nation, and Franco Nori, “Qutip 2: left for future work. A python framework for the dynamics of open quantum [50] One may argue that the same scalability issue applies to systems,” Computer Physics Communications 184, 1234 the cost function in Eq. (1) used for learning. However, – 1240 (2013). 13

[66] Lester James V. Miranda, “Pyswarms, a research-toolkit [68] Elizaveta Levina and Peter Bickel, “The earth mover’s for particle swarm optimization in python,” (2017), distance is the mallows distance: Some insights from 10.5281/zenodo.986300. statistics,” in Computer Vision, 2001. ICCV 2001. [67] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas, Proceedings. Eighth IEEE International Conference on, “The earth mover’s distance as a metric for image re- Vol. 2 (IEEE, 2001) pp. 251–256. trieval,” International journal of computer vision 40, 99– [69] Ofir Pele and Michael Werman, “Fast and robust earth 121 (2000). mover’s distances,” in 2009 IEEE 12th International Con- ference on Computer Vision (IEEE, 2009) pp. 460–467.

Supplementary Material for “A generative modeling approach for benchmarking and training shallow quantum circuits”

Comparison of cost functions

In realistic machine learning scenarios, we typically do not have access to the complete target probability dis- tribution, nor to that obtained from the output state of a quantum circuit. Hence, we need to compare the two distributions at the level of histograms and using a finite number of samples and measurements. Here we compare three cost functions via simulations in silico. First, we defined the clipped negative log-likelihood as

D 1 X (d) (θ) = ln(max(, Pθ(x ))), (3) Cnll −D d=1 where probabilities are estimated from samples and  > 0 is a small number that avoids an infinite cost when (d) Pθ(x ) = 0. This may happen if for example entanglement in the circuit prevents us from measuring configuration x(d). Moreover, when the number of variables N is large, all except few configurations will ever be measured due to the finite number of samples. Note that by re-normalizing all the probabilities after the clipping, we could interpret this variant as a Laplace additive smoothing. All the experiments in the main text are carried out using the clipped negative log-likelihood with  = 10−8. Second, we defined the earth mover’s distance [67] as

emd(θ) = min d(x, y) F , (4) C F h i P P where F (x, y) is a joint probability distribution such that y F (x, y) = PD(x) and x F (x, y) = Pθ(y), that is, its marginals correspond to the data and circuit distributions, respectively. Intuitively, this is the minimum cost of turning one histogram into the other where the ground metric d(x, y) specifies the cost of transporting a single unit from x to y. We chose d(x, y) to be the Hamming distance between strings x 1, +1 N and y 1, +1 N . Since we normalize histograms to sum up to one, the Earth Mover’s Distance is equivalent∈ {− to} the 1-st Wasserstein∈ {− } distance [68]. In our simulations, we use the PyEMD Python library for fast computation of the earth moving distance [69]. Third, we defined the moment matching as

N N 1 X 2 X (θ) = ( x x )2 + ( x x x x )2, (5) Cmm N h iiPD − h iiPθ N(N 1) h i jiPD − h i jiPθ i − i>j where the expectation values PD and Pθ are taken with respect to data and circuit distributions, respectively. This cost function can be generalized to include moments beyond the second as well as using different positive exponents for the error. We compared the cost functions on the task of learning thermal states of size N = 5, with L = 3 layers and all topology. Figure5 shows the bootstrapped median KL divergence on 25 realizations and 90% confidence interval as learning progress for 100 iterations. The fact that nll (red diamonds) outperforms other cost functions does not come as a surprise; minimization of the negative log-likelihoodC is indeed directly related to minimization of the KL divergence. However, we expect the performance of DDQCL based on this cost function to degrade quickly as the size of the problem increases. In realistic applications, the relevant probabilities in Eq. (3), i.e. those associated with the 14 data, are a vanishing fraction of the 2N probabilities. Moreover, they need to be estimated from a finite number of measurements. The earth mover’s distance emd (green pentagons) performs well, but it suffers from similar scalability issues. Fast algorithms for the computationC of this distance may struggle when the number of bins in the histogram increases exponentially as in our case. However, it is reassuring to see that alternative cost functions with no relation to the KL divergence can still produce satisfactory results. Surprisingly, the moment matching mm (purple crosses) closely tracks the other cost functions while retaining computational efficiency. In fact, even thoughC a large number of samples may be needed to obtain low-variance estimates for the moments, only (N 2) terms are computed at each iteration. We expect this cost function to be a good heuristic for DDQCL on largeO systems.

1.0

0.9 Cemd nll 0.8 C Cmm 0.7

0.6

0.5 KL divergence 0.4

0.3

0.2 0 20 40 60 80 100 Iteration

FIG. 5. Bootstrapped median KL divergence and 90% confidence interval of circuits learned by minimizing different cost functions. Both the moment matching (Cmm) and earth mover’s distance (Cemd) closely track the clipped negative log-likelihood (Cnll) used in the main text.

Approximate preparation of coherent thermal states for N = 6

FIG. 6. DDQCL preparation of approximate coherent thermal states for the case of six qubits, different circuit depths L ∈ {1, 2, 3}, and different temperatures T ∈ {2Tc,Tc,Tc/1.5}. In these simulations, an all-to-all topology was used for the entangling gate layer (l = 2). We report the bootstrapped median and 90% confidence interval. For the low temperature case shown in panel (c), the inverse Bethe approximation converged only in 7 out of 25 instances. Hence, no median value was extracted. We plotted a KL divergence of 2.0 as a reference, but it is clear that for all those instances DDQCL outperformed the inverse Bethe approximation. 15

Details for theoretical and experimental results

FIG. 7. Detailed comparison at the level of output states between simulation of circuits and the respective experimental implementations in the ion trap quantum computer. These are the best circuits in terms of KL divergence that were obtained by DDQCL for BAS(2, 2) under three different setting: (a) all-to-all with L = 2 layers, (b) star with L = 4, and (c) star with L = 2. The theoretical state obtained in (a) is close to optimal and attains an entanglement entropy averaged over all two-qubit subsets of SBAS(2,2) = 1.69989. Circuit diagrams for (a-c) are shown in Figure8.

0 Rx(+.00π) Rz(+.00π) XX XX XX | i

0 Rx(+.00π) Rz(+.50π) +.50π XX XX (a) | i 0 Rx(+.00π) Rz(+.25π) απ απ XX | i − −

0 Rx(+.00π) Rz(+.00π) +απ απ .50π | i − −

0 Rx(+.15π) Rz( .01π) XX XX XX Rz( .63π) Rx( .42π) Rz(+.26π) XX XX XX | i − − −

0 Rx(+.30π) Rz(+.31π) +.48π Rz(+.01π) Rx(+.21π) Rz(+.77π) .01π (b) | i − 0 Rx(+.61π) Rz(+.24π) .76π Rz(+.95π) Rx( .62π) Rz(+.99π) +.51π | i − −

0 Rx( .31π) Rz( .99π) .13π Rz(+.65π) Rx( .56π) Rz(+.74π) .46π | i − − − − −

0 Rx(+.11π) Rz( .02π) XX XX XX | i −

0 Rx( .88π) Rz(+.28π) .48π (c) | i − − 0 Rx(+.20π) Rz( .48π) .32π | i − −

0 Rx( .23π) Rz( .15π) .32π | i − − −

FIG. 8. Circuit diagrams for the analytical solution and for the best circuits found by DDQCL under three different qubit- to-qubit connectivity topologies. (a) the all-to-all circuit with L = 2 layers can achieve zero KL divergence for BAS(2, 2) by setting α = π−1 arctan(2−1/2). All single-qubit rotations can be set to zero. DDQCL found an almost optimal solution with α = 0.2 and two non-zero Rz rotations. These Rz gates act as the identity on the |0000i state. (b) the star circuit with L = 4. (c) the star circuit with L = 2. The qBAS scores for (a-c) are shown in Figure2 in the Main Text.

Entanglement entropy of BAS(2,2)

The measure of entanglement entropy used in this work is the average von Neumann entropy over all 2-qubit subsets [59]. Consider a 4-qubit pure state ρ = ψ ψ and label the four qubits as A, B, C, and D. Then, the entropy can be computed as | i h | 1h i S = Tr(ρ log ρ ) + Tr(ρ log ρ ) + Tr(ρ log ρ ) , (6) ψ −3 AB 2 AB AC 2 AC AD 2 AD 16 where ρXY is the reduced for the subset XY . As an example, the 4-qubit cat state has an entanglement entropy of SGHZ = 1. Now consider a pure state encoding the uniform probability distribution over the BAS(2, 2) data set in the computational basis 1 BAS(2, 2) = eiu1 0000 + eiu2 0011 + eiu3 0101 + eiu4 1010 + eiu5 1100 + 1111  . (7) | i √6 | i | i | i | i | i | i A direct computation shows that the entropy of this state is h q q  S = 1 2 cos(u2−u3−u4+u5)+1 tanh−1 cos(u2−u3−u4+u5)+1 BAS(2,2) − 9 ln(2) 2 2

u1−u3−u4   2 2 u1−u3−u4  u1−u2−u5   2 2 u1−u2−u5  + cos 2 + 1 log2 3 cos 4 + cos 2 + 1 log2 3 cos 4  p   p  + log 4 + 2√2 cos (u u u + u ) + 1 + log 4 2√2 cos (u u u + u ) + 1 2 2 − 3 − 4 5 2 − 2 − 3 − 4 5 u1−u3−u4   2 2 u1−u3−u4  u1−u2−u5   2 2 u1−u2−u5  cos 2 1 log2 3 sin 4 cos 2 1 log2 3 sin 4 − i − − − log (31104) . − 2 (8) Defining new variables v = u u u + u and v = u u u , the expression above reduces to 1 2 − 3 − 4 5 2 1 − 3 − 4 h q q  S = 1 2 cos2 v1  tanh−1 cos2 v1  BAS(2,2) − 9 ln(2) 2 2

2 v2  2 2 v2  2 v2−v1  2 2 v2−v1  + 2 cos 4 log2 3 cos 4 + 2 cos 4 log2 3 cos 4  p   p  + log 4 + 2√2 cos (v ) + 1 + log 4 2√2 cos (v ) + 1 (9) 2 1 2 − 1 2 v2  2 2 v2  2 v2−v1  2 2 v2−v1  + 2 sin 4 log2 3 sin 4 + 2 sin 4 log2 3 sin 4 i log (31104) . − 2 In Figure9 we graphically show the entropy SBAS(2,2) as a function of the new variables v1 and v2. Such a function has extrema 1 27 min S = log 1.25163, BAS(2,2) 3 2 2 ≈ (10) 1 max S = log (12) 1.79248. BAS(2,2) 2 2 ≈ For the minimum value, v1 = v2 = 0, which can be obtained setting u1 = = u5 = 0. For the maximum value, v = 4π/3 and v = 2π/3, which can be obtained setting u = u = u = 0 and··· u = u = 2π/3. Interestingly, the 1 2 1 2 3 4 − 5 maximum of SBAS(2,2) happens to coincide with the maximum entanglement entropy known for any 4-qubit state [59].

FIG. 9. Entanglement entropy SBAS(2,2) as a function of variables v1 and v2. Points in the domain represent states that encode the BAS(2, 2) data set in the computational basis. The maximum entropy (black dots) attained is 1.79248 which coincides with the maximum entanglement entropy known for any 4-qubit state [59].