Deep autoregressive models for the efficient variational simulation of many-body quantum systems

1, 1, 1, 2, 1, Or Sharir, ∗ Yoav Levine, † Noam Wies, ‡ Giuseppe Carleo, § and Amnon Shashua ¶ 1The Hebrew University of Jerusalem, Jerusalem, 9190401, Israel 2Center for Computational Quantum Physics, Flatiron Institute, 162 5th Avenue, New York, NY 10010, USA Artificial Neural Networks were recently shown to be an efficient representation of highly-entangled many-body quantum states. In practical applications, neural-network states inherit numerical schemes used in Variational Monte Carlo, most notably the use of Markov-Chain Monte-Carlo (MCMC) sampling to estimate quantum expectations. The local stochastic sampling in MCMC caps the potential advantages of neural networks in two ways: (i) Its intrinsic computational cost sets stringent practical limits on the width and depth of the networks, and therefore limits their expressive capacity; (ii) Its difficulty in generating precise and uncorrelated samples can result in estimations of observables that are very far from their true value. Inspired by the state-of-the-art generative models used in machine learning, we propose a specialized Neural Network architecture that supports efficient and exact sampling, completely circumventing the need for Markov Chain sampling. We demonstrate our approach for two-dimensional interacting spin models, showcas- ing the ability to obtain accurate results on larger system sizes than those currently accessible to neural-network quantum states.

Introduction.– The theoretical understanding and been limited to relatively shallow architectures, far from modeling of interacting many-body quantum matter rep- the very deep networks used in modern machine learn- resents an outstanding challenge since the early days of ing applications. This practical limitation is mostly due quantum mechanics. At the heart of several problems in to two main factors. First, it is computationally expen- condensed matter, chemistry, nuclear matter, and more sive to obtain quantum expectation values over ConvNet lies the intrinsic difficulty of fully representing the many- states using stochastic sampling based on Markov Chain body wave-function, in principle needed to exactly solve Monte Carlo (MCMC), as is it is customary in VMC Schrodinger’s equation. These mainly fall into two cat- applications. Second, there is an intrinsic optimization egories: on one hand, there are states traditionally used bottleneck to be faced when dealing with a large number in stochastic Variational Monte Carlo (VMC) calcula- of parameters. However, both limitations are routinely tions [1]. Chief example are Jastrow wave-functions [2], faced when learning deep autoregressive-models, recently carrying high entanglement, but also with a limited varia- introduced machine-learning techniques that have en- tional freedom. On the other hand, more recently, tensor- abled previously intractable applications. network approaches have been put forward, based on In this paper, we propose a pivotal shift in the use of non-stochastic variational optimization, and most chiefly Neural-network Quantum States (NQS) for many-body on entanglement-limited variational wave-functions [3–6]. quantum systems, that markedly sets a discontinuity In an attempt to circumvent the limitations of the ap- with traditionally adopted VMC methods. Inspired by proaches above, architectures based on Artificial Neural the latest advances in generative machine learning mod- Networks (ANN) were proposed as variational wave func- els, we introduce variational states for which both the tions [7]. Restricted Boltzmann machines (RBM), which sampling and the optimization issues are substantially represent relatively veteran machine learning constructs, alleviated. Our model is composed of a ConvNet that were shown to be capable of representing volume-law en- allows direct, efficient, and i.i.d. sampling from the tanglement scaling in 2D [8–11]. Recently, other neural- highly entangled it represents. The net- network architectures have been explored. Most notably, work architecture draws upon successful autoregressive convolutional neural networks (ConvNets) – leading deep models for representing and sampling from probability learning architectures that stand at the forefront of em- distributions. Those are widely employed in the machine pirical successes in various Artificial Intelligence domains learning literature [15], and have been recently used for – have been applied to both bosonic [12] and frustrated arXiv:1902.04057v3 [cond-mat.dis-nn] 19 Jan 2020 statistical mechanics applications [16], as well as den- spin systems [13]. sity matrix reconstructions from experimental quantum Despite the provable theoretical advantage of ConvNet systems [17]. We generalize these autoregressive models architectures [14], however, early numerical studies have to treat complex-valued wave-functions, obtaining highly expressive architectures parametrizing an automatically normalized many-body quantum wave-function. Neural Autoregressive Quantum States.– We consider ∗ [email protected][email protected] in the following a pure quantum system, constituted by [email protected] N discrete degrees of freedom s (s , . . . , s )(e.g. spins, ‡ ≡ 1 N § gcarleo@flatironinstitute.org occupation numbers, etc.) such that the wave-function ¶ [email protected] amplitudes Ψ(s) fully specify its state. Here we follow 2 the approach introduced in [7], and represent ln(Ψ(s)) probabilities for si to take one of the M possible dis- as a feed-forward ANN, parametrized by a possibly large crete values sj, conditioned on given s1, . . . , si 1. It is − number of network connections. Given an arbitrary set crucial that each output vector vi does not depend on of quantum numbers, s, the output value computation of the value of si or any of the variables appearing with the corresponding NQS, known as its forward pass, can a larger index, si+1, . . . , sN , for a pre-chosen ordering. generally be described as a sequence of K matrix-vector To ensure that each network outputs a valid conditional multiplications separated by the applications of a non- distribution, it is then sufficient to take the exponent linear element-wise activation function σ:C C. More of each entry and normalizing it according to the l1 formally, the unnormalized log amplitudes are→ given by norm, i.e. p (s s , . . . , s )=exp(v )/ exp(v ) , i i i 1 1 i,si s0 i,s0 also known as a|Softmax− operation. | | P ln(Ψ(s)) = WK σ (Wk 1σ ( σ (W1s))) , (1) Even though it is possible to use N separate networks − ··· for each of the N conditional probabilities, and each ac- K ri ri 1 cepting a variable number of inputs, in practice it is where Wi C × − i=1, r0=N, rK =1, r1, . . . , rK 1 are knownW≡ { as the∈ widths}of the network, and K as the− more common to use a single ANN that accepts N in- depth. In practice, specialized variants of eq.1 are com- puts and outputs N probability vectors. In this case, monly used, e.g. early applications have focused on shal- the autoregressive property is enforced by masking the low architectures (k=1) such as Restricted Boltzmann inputs si, . . . , sN for the i’th output vector, i.e. ensur- Machines, for which the activation function is typically ing that the contributions of higher-ordered spins to the taken to be σ(z)= ln cosh(z). Other, deeper, choices are output of the network vanish. PixelCNN [18] is such an often advantageous, such as convolutional networks, in architecture, and is built as a sequence of masked convo- which most of the matrices are restricted to act on a lutional layers, whose filters are restricted to having zeros subset of the quantum numbers, computing convolutions at positions “ahead”. For example, in a one dimensional with small filters. system, a filter of width R, where R is odd, would be con- strained to have (w , . . . , w(R 1) , 0,..., 0), and thus the Given a NQS representation of a many-body quan- 1 − /2 tum state, estimating physical observables Ψ Ψ ith output of each layer depends uniquely on the indices h |O| i at s1, . . . , si 1. of a local operator , is in general analytically in- − tractable, but can beO realized numerically through a A chief advantage of networks with the autoregressive stochastic procedure, as done in VMC. Specifically, property, is that directly drawing samples according to Ψ Ψ = Oloc , where ... denote statistical ex- P (s) is conceptually straightforward. One can sample h |O| i h iP h iP pectation values over the Born probability density each si in sequence, according to its given conditional 2 loc (s) Ψ(s) , and O s s0 Ψ(s0)/Ψ(s) is the probability that depends just on the previously sampled s0 P ≡ | | ≡ h |O| i (s1, . . . , si 1). Carefully exploiting the intrinsic sparse- corresponding statistical estimator. In the vast majority − of VMC applications, includingP NQS so-far, a MCMC al- ness of the network weights, further leads to a very effi- gorithm is typically used to generate samples from (s). cient algorithm for sampling [19]. Remarkably, the com- P While MCMC is a rather flexible technique, it comes with plexity of sampling a full string s1 . . . sN in a PixelCNN a large computational cost, especially for deep ANNs. architecture can be reduced to the complexity of just a Additionally, though MCMC asymptotically generates single forward pass. samples that are correctly distributed, in practice it can Our NAQS model for representing wave-functions is be plagued by very large autocorrelation times, and lack based on the same NADE principles so-far described. of ergodicity, that can severely affect the quality of the Specifically, just as probability functions can be factor- samples being generated. ized into a product of conditional probabilities, we repre- In light of these limitations, we propose here a sent a normalized wave-function as a product of normal- specialized network architecture that instead supports ized conditional wave-functions, such that efficient and exact sampling. Our approach is an extension of Neural Autoregressive Density Estima- N Ψ(s1, . . . , sN )= ψi(si si 1, . . . , s1), (2) tors (NADE) [15] to quantum applications, result- | − i=1 ing in what we dub Neural Autoregressive Quantum Y States (NAQS). To start with, first consider the task where ψi(si si 1, . . . , s1) are such that, for any fixed of representing a with NADE | − i 1 models. These models build on the so-called autore- (s1, . . . , si 1) 1,...,M − , they satisfy the normal- − ∈ { } 2 gressive property, which entails a decomposition of the ization condition ψi(s0 si 1, . . . , s1) =1. If this s0 | | − | full probability distribution as a product of condition- condition holds, then a strong normalization condition N for the full wave-functionP follows (see app.A for proof): als, i.e. P (s1, . . . , sN )= i=1 pi(si si 1, . . . , s1). The power of these models comes from the| − observation that, Q N for every i, the conditional probabilities pi can be Claim 1 Let Ψ:[M] C such that N → N individually represented as an ANN receiving as in- Ψ(s1, . . . , sN )= ψi(si si 1, . . . , s1), where ψi i=1 | − { }i=1 put the variables s1, . . . , si 1 and outputting a vector are normalized conditional wave-functions. Then, Ψ is − Q 2 vi (vi,s , vi,s , . . ., vi,s ) representing the unnormalized normalized, i.e., Ψ(s1, . . . , sN ) =1. ≡ 1 2 M s1,...,sN | | P 3

2 (a) Neural Autoregressive (b) Sampling (s1, . . . , s4) ∼ |Ψ(s1, . . . , s4)| l -normalization Spins Masked 2 Convolutions (in log-space) Step 1: Step 2: Step 3: Step 4: Normalized s1 lnΨ1 s s s Wave-function ∼ s1 1 1 1 s2 s2 s2 lnΨ2 ∼ s2 P ln Ψ (s | ...) i i i ∼ s s3 s3 lnΨ3 3 ∼ s4 s4 lnΨ4

FIG. 1. Neural Autoregressive Quantum States are neural networks that represent a normalized wave-function, Ψ(s1, . . . , sN ), by factoring it to a sequence of normalized conditional wave-functions, denoted by Ψi(si si 1, . . . , s1) for the i’th particle, in − a manner similar to that of Neural Autoregressive Density Estimator (see eq.3). (a) Illustration| of a deep 1D-convolutional NAQS model following the PixelCNN [18] architecture. Each column of nodes represent a layer in the network, starting with the input layer representing the N-particle configuration (s1, . . . , sN ). Each internal node in the graph is a complex vector computed according to its layer type. Namely, masked convolutions are limited to having local connectivity, where a node at the j’th row is only connected to nodes with connections to si where i

As in the NADE case, we represent conditional wave- Optimization.– The NAQS representation of many- function with an ANN accepting (s1, . . . , si 1) and out- body wave functions can be used in practice for several − M putting a complex vector vi (vi,s1 , vi,s2 , . . . , vi,sM ) C applications. These include for example ground-state for each of the M possible≡ values taken by the∈ lo- search [7], quantum-state tomography [20], dynam- cal quantum numbers si. To obtain a normalized ics [7], and quantum circuits simulation [21]. Here conditional wave-function, we take its exponent we more specifically focus on the task of finding and normalize it according to the l -norm, i.e., the ground state of a given Hamiltonian . In this 2 H 2 context, we denote by Ψ the wave-function rep- ψi(si si 1, . . . , s1) =v ˆi,s exp(vi,s )/ exp(vi,s ) . W | − i ≡ i s | 0 | resented by a NAQS of a fixed architecture that is Given this parametrization, the full wave-function0 qP parameterized by , and we wish to find values log-amplitude ln Ψ(s1 . . . sN ) is easily obtained, once all W W that minimize the energy, i.e., ∗= argmin E( ), the vectors v1,..., vN have been computed, as given by: W W W where E( ) Ψ Ψ = Es Ψ 2 [Eloc(s; )], W ≡h W |H| W i ∼| W | W Ψ (s0) N E (s; ) W , and is usually a highly loc s0 s,s0 Ψ (s) 1 2 W ≡ H W H ln Ψ(s)= v ln exp(v ) . (3) sparse matrix, and so computing E for a given sample i,si − 2 | i,s0 | P loc i=1 s0 ! takes at most O(N) forward passes. X X The common approach for solving the optimization As in the probabilistic autoregressive model, we can rep- problem above with an NQS is to estimate the gradient of resent the entire NAQS by a single neural network out- E( ) with respect to , and use variants of stochastic putting N complex vectors, as illustrated in Fig.1a. gradientW descent (SGD)W to find the minimizer of E( ). Though our proposed architecture can work with either Estimating the gradient can be done by first employingW complex- or real-parameters, we have found that using a variant of the log-derivative trick, i.e., the latter work better, where we represent each com- plex conditional log-amplitude using two real values, log- ∂E ∂ ln Ψ = E 2 Re (Eloc(s)∗ E∗) W . (4) magnitude and phase. ∂ s Ψ 2 − ∂ Moreover, there is a special relationship between W ∼| W |   W  a NAQS and its induced Born probability, since Now, while we can efficiently compute the log deriva- 2 N 2 Ψ(s1, . . . , sN ) = i=1 ψi(si si 1, . . . , s1) , implying tive of Ψ , exactly computing the expected value is in- | 2 | | | − | W that ψi(s) is a valid conditional probability. Thus, tractable, but we can still approximate it by computing | | Q B the induced Born probability of a NAQS has the ex- its value over a finite batch of samples s(i) i=1. The act same structure of a NADE model. Specifically, quality of this approximation depends on{ the} batch size, taking the squared magnitude of its output vectors, B, but also on the degree of correlations between the in- 2 i.e., i, s0, v¯i,s0 = vˆi,s0 , transform NAQS into a standard dividual samples. The advantages of our direct sampling NADE∀ representation| | of this distribution, which impor- method supported by NAQS over MCMC are twofold in tantly includes its efficient and exact sampling method. this context: (i) Faster sampling: each individual sample In contrast to standard MCMC sampling employed for can be generated with fewer network passes, and gener- correlated wave-functions, NAQS thus allows for direct, ating a batch of samples is embarrassingly parallel, as efficient sampling with the computational complexity of opposed to the sequential nature of MCMC; (ii) Faster a single forward pass, as depicted in Fig.1b. convergence: because the generated samples are exact 4

Γ NAQS Energy QMC Energy NAQS σz QMC σz h| |i h| |i Direct Sampling MCMC k = 100 MCMC k = 10 2.0J -2.4096022(2) -2.40960(3) 0.78326(2) 0.78277(38) MCMC k = 300 MCMC k = 50 3 10− 2.5J -2.7476550(5) -2.74760(3) 0.57572(3) 0.57566(63) 2 10− 3.0J -3.1739005(5) -3.17388(4) 0.16179(4) 0.16207(54)

4 3.5J -3.6424799(3) -3.64243(4) 0.11094(3) 0.11011(30) 10− 4.0J -4.1217979(2) -4.12178(4) 0.09725(2) 0.09728(24)

3 Relative Error (solid lines) 10− 5 TABLE I. Estimates of the ground state energies of the Energy Variance (dashed lines) 10− 0 2500 5000 7500 10000 12500 15000 17500 20000 transverse-field Ising model for different values of Γ on a Iteration 12 12 lattice, and the corresponding estimates of σz , as × h i obtained by either NAQS or QMC. FIG. 3. Comparing the effects of the sampling method, either MCMC or direct sampling, on the training procedure for the transverse-field Ising model with Γ=3J, close to the critical value, on a large (21 21) lattice. When using MCMC, sam- ples are taken every×k 10, 50, 100, 300 steps in the chain, where increasing k reduces∈ { the correlation} between samples at Direct Sampling the expense of increased computational cost. The solid lines MCMC 1 shows the relative error to the minimal energy found for this MCMC 2 system in our experiments, and dashed lines shows the energy MCMC 3 MCMC 4 variance. Since MCMC takes a considerable time to complete just a single iteration, we have restricted the training to max- imum of 100 hours. FIG. 2. An illustration of the two modes of the ground state, by taking the first two principal components of the generated samples. The green points correspond to our di- phase transition. rect sampling method, and the other colors represent different In order to quantify the behavior of our model in a MCMC chains. The plot was generated by training a NAQS region of broken symmetry, we consider the case of a on the transverse-field Ising model with Γ=2J, below the crit- ical value, on a 12 12 lattice until convergence to the ground transverse-field deep in the ferromagnetic region, namely state, and then sampling× from the trained NAQS using either Γ=2J. The PCA visualization in Fig.2 shows that for our direct sampling method, or 4 separate MCMC samplers. this value of Γ the MCMC chains initialized at one of the oriented states composing the ground state are stuck at that specific orientation and cannot come around to sam- and i.i.d., and so result in more accurate estimates of pling spin configurations that correspond to the opposite the gradient at each step. orientation. In contrast, spin configurations sampled di- Experiments.– As a first benchmark for our approach, rectly from the distribution by using our proposed tech- we consider a case where MCMC sampling can be nique include equally probable configurations from both strongly biased. A paradigmatic quantum system ex- orientations. The ergodicity breaking in local MCMC hibiting this issue is found in the ferromagnetic phase is also directly quantifiable by the expectation value of of the transverse field Ising model. The Hamiltonian the total magnetization m σz , for which we ex- ≡ h i i i for this model is given by H= J σi σj Γ σi , pect m=0 on any finite lattice. Indeed, the i.i.d. sam- z z i x P where the summation runs over− pairs of lattice− edges. pling enabled by our model correctly explores the two Here we study the case of a 2D squareP lattice withP open relevant ferromagnetic states (in agreement with the vi- boundary conditions, and for varying strengths of the sualization of Fig.2) and reaches a value close to a total transverse field. The system is in a ferromagnetic phase zero magnetization, in stark contrast with MCMC esti- when the transverse magnetic field Γ is weak with re- mation that effectively computes σz 0.78 rather than h| |i ≈ spect to the coupling constant, and specifically in 2D m. As expected, directly estimating σz with our sam- h| |i when Γ<Γc 3.044J [22]. pling method correctly recovers it to a high precision, see In order to' verify the correctness of the model proposed TableI. in section 2, we begin by comparing the ground state en- The limitation of the MCMC procedure in providing ergy and system magnetization obtained for a 12 12 sys- independent samples is not only conceptually relevant, tem with those obtained by an unbiased quantum× Monte but it can also have consequences on the quality of the re- Carlo (QMC) simulation. Using our open-source library, sulting ground-state approximations. In Fig.3, we show FlowKet [23], we employ a NAQS model following the the training procedure for the transverse-field Γ=3J, PixelCNN architecture, using the ADAM [24] SGD vari- close to the critical value on a larger system (21 21). ant with the gradient estimator of eq.4. Additional The same NAQS architecture was trained once with× the technical details are listed in app.B. TableI shows that i.i.d. sampling procedure and once with MCMC chains our model achieves very high accuracy for both magneti- of varying lengths. The optimization advantage obtained zation and energy densities for different transverse field when relying on independent samples clearly emerges values across the phase diagram: when the system is in from those figures – this procedure is much quicker and the ferromagnetic phase, the normal phase, and near the results in a significantly more accurate ground state en- 5

Lattice PEPS NAQS QMC ACKNOWLEDGMENTS 10 10 -0.628601(2) -0.628627(1) -0.628656(2) 16×16 -0.643391(3) -0.643448(1) -0.643531(2) × This work is supported by ISF Center grant 1790/12 TABLE II. Ground state energies for the antiferromagnetic and by the European Research Council (TheoryDL Heisenberg model with open boundary conditions, as obtained project). Yoav Levine is supported by the Adams Fel- by a state-of-the-art PEPS model [25], our NAQS model, and lowship Program of the Israel Academy of Sciences and the exact QMC estimation, as reported in Liu et al. [25]. Humanities. QMC simulations for the 2D Transverse- Field Ising Model have been performed using the open- source ALPS Library [29]. ergy and lower energy variance H2 H 2. h i−h i As a further benchmark, we also apply our method Appendix A: Proof of Claim1 to a more complex system, the two-dimensional an- tiferromagnetic Heisenberg model with open bound- The proof follows an induction argument. For N = 1, ary conditions, whose Hamiltonian is given by it holds that Ψ(s ) Ψ (s ), and so Ψ is normalized H= σi σj +σi σj +σi σj . We evaluate our approach 1 ≡ 1 1 i,j x x y y z z because Ψ1 is normalized with respect to s1. Assume the by comparingh i the ground state energy obtained for P claim holds for N = k, then for N = k + 1 we first define 10 10 and 16 16 systems with those obtained by ˜ k × × Ψ(s1, . . . , sk) i=1 Ψi(si si 1, . . . , s1), and so QMC simulations, as well as other variational meth- ≡ | − ods. We find that NAQS meaningfully improve upon Q 2 Ψ(s , . . . , s ) the accuracy of the best known variational methods for | 1 k+1 | s ,...,s this problem. Namely, for 10 10, a relative error of 1 Xk+1 5 5 × k+1 8.7 10− 0.6 10− was reported in Liu et al. [25] us- 2 ing× a PEPS± model,× whereas with our approach we were = Ψi(si si 1, . . . , s1) | | − | 5 5 s ,...,s i=1 able to obtain 3.5 10− 0.4 10− . See tableII for ex- 1 Xk+1 Y act results. While× the PEPS± × results can be, in principle, =1 k ∗ further improved, increasing the accuracy comes with a 2 2 = Ψi(si si 1, . . . , s1) Ψi(sk+1 sk, . . . , s1) very significant computational requirements [26] due to | | − | | | | s ,...,s i=1 !s the unfavorable computational scaling w.r.t. the bond- 1X k Y zXk+1 }| { dimension. Moreover, though not directly comparable, it = Ψ(˜ s1, . . . , sk) ∗∗= 1, is noteworthy that the relative error of the ground state s1,...,sk X energy with periodic boundary conditions obtained by NQS with MCMC sampling is significantly less accurate where ( ) is because Ψk+1 is a normalized conditional ∗ than ours (Carleo and Troyer [27], Choo et al. [28] report wave function, and ( ) because of the induction assump- 4 ∗∗ relative error greater than 2 10− ). tion.  × Discussion.– In this work, we have shown a scheme to facilitate the practical employment of contemporary Appendix B: Technical Details deep learning architectures to the modeling of many- body quantum systems. This constitutes a striking im- provement over currently used RBM methods that are In this section we cover the essential technical details limited to only hundreds of parameters, and very shallow of our models and how they are optimized. networks. A further practical advantage we gain is the ability to make use of the substantial body of knowledge regarding optimization of these architectures that is accu- 1. Architecture mulating in the deep learning literature. We empirically demonstrate that by employing common deep learning Our chosen architecture for our implementation of optimization methods such as SGD, our direct sampling Neural Autoregressive Quantum State is loosely inspired approach allows us to train very large convolutional net- by that of PixelCNN [18], which uses a row-wise order- works (20 layers, 21 21 lattice, 1M parameters). Our ing of the particles for the conditional wave-functions. All presented experiments× demonstrate∼ that even for rela- parameters and operations in the network are real, where tively simple systems MCMC sampling can fail, while the the complex log-amplitudes of the conditional wave func- i.i.d. sampling enabled by our model succeeds. Relying tions are represented as two real numbers, as discussed in on the theoretically promising results regarding convolu- the body. More specifically, it is composed of two inter- tional networks’ capabilities in representing highly entan- acting branches: (i) a “vertical” branch for representing gled systems [14], namely, systems satisfying volume-law, conditional dependencies between a given particle and all we view the enabling of their optimization as an integral particles above it in the 2D lattice it resides on, and (ii) a step in reaching currently unattainable insight on a vast “horizontal” branch for representing conditional depen- variety of quantum many body phenomena. dencies between a given particle and all particles to its 6

2. Handling Symmetries

While our general NAQS architecture can already rep- 3x3 Conv with resent wave-functions quite well, we have found that zeroing at (3,3) leveraging the inherent symmetries of a given problem, e.g., invariance to rotations and flips, can dramatically Vertical and improve the accuracy of our model. Specifically, we use Horizontal a self-ensemble scheme to symmetrize our model, where Zero-Padding for a model f(s) and a given input spin-configuration, we transform it according to its symmetries, denoted by the set , run each of them through our model, and 3x3 Conv Down Shift Concat aggregateT the resulting log-amplitude outputs using the following equations for our symmetrize model Sym(f):

Vertical Zero Padding 1 1 2 Re(f(T s)) Re(Sym(f)(s))= ln e · , (B1) 2 T ! X∈T |T | i Im(f(T s)) Vertical Branch Horizontal Branch Im(Sym(f)(s))=Im ln e · . (B2) T !! X∈T FIG. 4. An illustration of a single block that our architecture It is important to emphasize that while there are many is composed of. ways to symmetrize a model, we cannot use any aggre- gation operation – it must also preserve its probabilistic meaning for we to be able to sample from it efficiently. We propose to incorporate the possible symmetries into our generative model, assuming we first sample a trans- left. Each branch comprises a sequence of convolutional formation T from with equal probability 1/ , and then T layers. The “vertical” containing convolutional layers draw a sample fromT our model as described in the main with 3 3 filters and 32 channels, where we add two × text, followed by transforming it with T . This translates rows of zero-padding to the top lattice before applying to a mixture model over the squared magnitudes of the the convolution, to ensure a particle is not dependent network’s predicted amplitude, and eq. B1 realizes it in on rows below it. For the “horizontal” branch, we use log-space, where the real part of the output represent convolutional layers with 3 3 filters and 32 channels, × the log-magnitude. For the imaginary part, we have less where we add two rows and two columns of zero-padding restrictions and most symmetric operators would work, to the top and left of the lattice before applying the con- but we found that the mean of circular quantities of the volution, as well as setting the parameters at the (3, 3) phases, as expressed in eq. B2, worked best in our exper- indices to be zero, all to ensure that a particle only de- iments. pends on particles not “ahead” of itself according to the row-wise ordering we enforce. To combined them, the two branches are connected using the following scheme: af- ter every convolutional layer in the “vertical” branch, but 3. Optimization before the convolutional layer of the “horizontal” branch, we take the intermediate result of the “vertical” branch, In our experiments we employ the following general op- shift every entry “down“ along the vertical axis of the timization strategy. We begin using the Adam [24] SGD lattice, and concatenate it along the channels axis of the variant, using a small batch of 100 samples for estimat- horizontal branch. The final convolutional layer of the ing the gradient and using a learning rate in the order 3 “horizontal” branch serves as the output of the network, of 10− . After about 10K gradient update steps, we in- and hence we only use 2 output channels in that final crease the batch size to 1000, using the same learning layer, the two coordinates serving as the real and imagi- rate and optimizer as the first stage, for an additional nary parts of the log-amplitude of the conditional wave- 10K update steps. In the final stage of the optimiza- functions. The complete network can be depicted as a tion, we increase the batch size again, and also switch to sequence of blocks, each as illustrated in fig.4. We typ- standard SGD with a momentum term, for an additional ically use between 10 and 40 such blocks, depending on 5K update steps. For each experiment, we test multiple the specific experiment. This separation of “vertical” and variations around the above default values of batch size, “horizontal” branches is a technique to overcome what is learning rate, and number of update steps in each stage, known as the “blind spot” problem of the original Pixel- and report the results for the best performing models. CNN architecture (see van den Oord et al. [18] for more [1] W. L. McMillan, “Ground State of Liquid He4,” Physical details). Review 138, A442–A451 (1965). 7

[2] Robert Jastrow, “Many-Body Problem with Strong Thomas S Huang, “Fast generation for convolutional au- Forces,” Physical Review 98, 1479–1484 (1955). toregressive models,” arXiv preprint arXiv:1704.06001 [3] Mark Fannes, Bruno Nachtergaele, and Reinhard F (2017). Werner, “Finitely correlated states on quantum spin [20] Giacomo Torlai, Guglielmo Mazzola, Juan Carrasquilla, chains,” Communications in mathematical physics 144, Matthias Troyer, Roger Melko, and Giuseppe Carleo, 443–490 (1992). “Neural-network quantum state tomography,” Nature [4] David Perez-Garc´ıa,Frank Verstraete, Michael M Wolf, Physics 14, 447 (2018). and J Ignacio Cirac, “Matrix product state representa- [21] Bjarni Jnsson, Bela Bauer, and Giuseppe Car- tions,” Quantum Information and Computation 7, 401– leo, “Neural-network states for the classical simula- 430 (2007). tion of quantum computing,” arXiv:1808.05232 [cond- [5] Frank Verstraete and J Ignacio Cirac, “Renormalization mat, physics:physics, physics:quant-ph] (2018), arXiv: algorithms for quantum-many body systems in two and 1808.05232. higher dimensions,” arXiv preprint cond-mat/0407066 [22] Henk W. J. Blte and Youjin Deng, “Cluster Monte Carlo (2004). simulation of the transverse Ising model,” Physical Re- [6] Guifr´eVidal, “Class of quantum many-body states that view E 66 (2002), 10.1103/PhysRevE.66.066110. can be efficiently simulated,” Physical review letters 101, [23] FlowKet: an open-source library based on Tensorflow for 110501 (2008). running Variational Monte-Carlo simulations on GPUs, [7] Giuseppe Carleo and Matthias Troyer, “Solving the https://github.com/HUJI-Deep/FlowKet. quantum many-body problem with artificial neural net- [24] Diederik Kingma and Jimmy Ba, “Adam: A method for works,” Science 355, 602–606 (2017). stochastic optimization,” 3rd International Conference [8] Dong-Ling Deng, Xiaopeng Li, and S Das Sarma, “Quan- on Learning Representations (ICLR) (2014). tum entanglement in neural network states,” Physical [25] Wen-Yuan Liu, Shao-Jun Dong, Yong-Jian Han, Guang- Review X 7, 021021 (2017). Can Guo, and Lixin He, “Gradient optimization of finite [9] Jing Chen, Song Cheng, Haidong Xie, Lei Wang, and projected entangled pair states,” Physical Review B 95, Tao Xiang, “Equivalence of restricted boltzmann ma- 195154 (2017). chines and tensor network states,” Phys. Rev. B 97, [26] L. He, H. An, C. Yang, F. Wang, J. Chen, C. Wang, 085104 (2018). W. Liang, S. Dong, Q. Sun, W. Han, W. Liu, Y. Han, [10] Ivan Glasser, Nicola Pancotti, Moritz August, Ivan D. and W. Yao, “Peps++: Towards extreme-scale simula- Rodriguez, and J. Ignacio Cirac, “Neural-network quan- tions of strongly correlated quantum many-particle mod- tum states, string-bond states, and chiral topological els on sunway taihulight,” IEEE Transactions on Parallel states,” Phys. Rev. X 8, 011006 (2018). and Distributed Systems 29, 2838–2848 (2018). [11] Raphael Kaubruegger, Lorenzo Pastori, and Jan Carl [27] Giuseppe Carleo and Matthias Troyer, “Solving the Budich, “Chiral topological phases from artificial neural quantum many-body problem with artificial neural net- networks,” Physical Review B 97, 195136 (2018). works,” Science 355, 602–606 (2017). [12] Hiroki Saito and Masaya Kato, “Machine Learning Tech- [28] Kenny Choo, Titus Neupert, and Giuseppe Carleo, nique to Find Quantum Many-Body Ground States of “Study of the two-dimensional frustrated j1-j2 model Bosons on a Lattice,” Journal of the Physical Society of with neural network quantum states,” arXiv preprint Japan 87, 014001 (2017). arXiv:1903.06713 (2019). [13] Kenny Choo, Giuseppe Carleo, Nicolas Regnault, and [29] B. Bauer, L. D. Carr, H. G. Evertz, A. Feiguin, Titus Neupert, “Symmetries and Many-Body Excitations J. Freire, S. Fuchs, L. Gamper, J. Gukelberger, E. Gull, with Neural-Network Quantum States,” Physical Review S. Guertler, A Hehn, R. Igarashi, S. V. Isakov, D. Koop, Letters 121, 167204 (2018). P. N. Ma, P. Mates, H. Matsuo, O. Parcollet, G. Pa- [14] Yoav Levine, Or Sharir, Nadav Cohen, and Amnon wowski, J. D. Picon, L Pollet, E. Santos, V. W. Scarola, Shashua, “Quantum entanglement in deep learning ar- U. Schollwck, C. Silva, B. Surer, S. Todo, S. Trebst, chitectures,” Physical review letters (2019). M. Troyer, M. L. Wall, P Werner, and S. Wessel, [15] Benigno Uria, Marc-Alexandre Cote, Karol Gregor, Iain “The ALPS project release 2.0: open source software for Murray, and Hugo Larochelle, “Neural Autoregressive strongly correlated systems,” Journal of Statistical Me- Distribution Estimation,” Journal of Machine Learning chanics: Theory and Experiment 2011, P05001 (2011). Research () 17, 1–37 (2016). [16] Dian Wu, Lei Wang, and Pan Zhang, “Solving Statistical Mechanics using Variational Autoregressive Networks,” (2018). [17] Juan Carrasquilla, Giacomo Torlai, Roger G Melko, and Leandro Aolita, “Reconstructing quantum states with generative models,” Nature Machine Intelligence 1, 155– 161 (2019). [18] Aaron van den Oord, Nal Kalchbrenner, Lasse Espe- holt, Oriol Vinyals, Alex Graves, et al., “Conditional im- age generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems (2016) pp. 4790– 4798. [19] Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Deep autoregressive models for the efficient variational simulation of many-body quantum systems – Supplementary material

1, 1, 1, 2, 1, Or Sharir, ∗ Yoav Levine, † Noam Wies, ‡ Giuseppe Carleo, § and Amnon Shashua ¶ 1The Hebrew University of Jerusalem, 9190401, Israel 2Center for Computational Quantum Physics, Flatiron Institute, 162 5th Avenue, New York, NY 10010, USA

Appendix A: Proof of Claim ?? interacting branches: (i) a “vertical” branch for repre- senting conditional dependencies between a given parti- The proof follows an induction argument. For N = 1, cle and all particles above it in the 2D lattice it resides on, and (ii) a “horizontal” branch for representing con- it holds that Ψ(s1) Ψ1(s1), and so Ψ is normalized ≡ ditional dependencies between a given particle and all because Ψ1 is normalized with respect to s1. Assume the claim holds for N = k, then for N = k + 1 we first define particles to its left. Each branch comprises a sequence k of convolutional layers. The “vertical” containing convo- Ψ(˜ s1, . . . , sk) Ψi(si si 1, . . . , s1), and so ≡ i=1 | − lutional layers with 3 3 filters and 32 channels, where × Q 2 we add two rows of zero-padding to the top lattice before Ψ(s , . . . , s ) | 1 k+1 | applying the convolution, to ensure a particle is not de- s ,...,s 1 Xk+1 pendent on rows below it. For the “horizontal” branch, k+1 2 we use convolutional layers with 3 3 filters and 32 chan- = Ψi(si si 1, . . . , s1) nels, where we add two rows and× two columns of zero- | | − | s ,...,s i=1 1 Xk+1 Y padding to the top and left of the lattice before apply- =1 ing the convolution, as well as setting the parameters at k ∗ 2 2 the (3, 3) indices to be zero, all to ensure that a particle = Ψi(si si 1, . . . , s1) Ψi(sk+1 sk, . . . , s1) | | − | | | | only depends on particles not “ahead” of itself accord- s ,...,s i=1 !s 1X k Y zXk+1 }| { ing to the row-wise ordering we enforce. To combined = Ψ(˜ s1, . . . , sk) ∗∗= 1, them, the two branches are connected using the follow-

s1,...,sk ing scheme: after every convolutional layer in the “ver- X tical” branch, but before the convolutional layer of the where ( ) is because Ψk+1 is a normalized conditional “horizontal” branch, we take the intermediate result of ∗ wave function, and ( ) because of the induction assump- the “vertical” branch, shift every entry “down“ along the ∗∗ tion.  vertical axis of the lattice, and concatenate it along the channels axis of the horizontal branch. The final con- volutional layer of the “horizontal” branch serves as the Appendix B: Technical Details output of the network, and hence we only use 2 output channels in that final layer, the two coordinates serving as In this section we cover the essential technical details the real and imaginary parts of the log-amplitude of the of our models and how they are optimized. conditional wave-functions. The complete network can be depicted as a sequence of blocks, each as illustrated in fig.1. We typically use between 10 and 40 such blocks, 1. Architecture depending on the specific experiment. This separation of “vertical” and “horizontal” branches is a technique to Our chosen architecture for our implementation of overcome what is known as the “blind spot” problem of Neural Autoregressive Quantum State is loosely inspired the original PixelCNN architecture (see ? ] for more de- by that of PixelCNN [? ], which uses a row-wise order- tails). ing of the particles for the conditional wave-functions. All parameters and operations in the network are real, where 2. Handling Symmetries arXiv:1902.04057v3 [cond-mat.dis-nn] 19 Jan 2020 the complex log-amplitudes of the conditional wave func- tions are represented as two real numbers, as discussed in the body. More specifically, it is composed of two While our general NAQS architecture can already rep- resent wave-functions quite well, we have found that leveraging the inherent symmetries of a given problem, e.g., invariance to rotations and flips, can dramatically improve the accuracy of our model. Specifically, we use ∗ [email protected][email protected] a self-ensemble scheme to symmetrize our model, where ‡ [email protected] for a model f(s) and a given input spin-configuration, § gcarleo@flatironinstitute.org we transform it according to its symmetries, denoted by ¶ [email protected] the set , run each of them through our model, and T 2

3 of 10− . After about 10K gradient update steps, we in- crease the batch size to 1000, using the same learning rate and optimizer as the first stage, for an additional 3x3 Conv with 10K update steps. In the final stage of the optimiza- zeroing at (3,3) tion, we increase the batch size again, and also switch to standard SGD with a momentum term, for an additional Vertical and 5K update steps. For each experiment, we test multiple Horizontal variations around the above default values of batch size, Zero-Padding learning rate, and number of update steps in each stage, and report the results for the best performing models.

3x3 Conv Down Shift Concat

Vertical Zero Padding

Vertical Branch Horizontal Branch

FIG. 1. An illustration of a single block that our architecture is composed of. aggregate the resulting log-amplitude outputs using the following equations for our symmetrize model Sym(f):

1 1 2 Re(f(T s)) Re(Sym(f)(s))= ln e · , (B1) 2 T ! X∈T |T | i Im(f(T s)) Im(Sym(f)(s))=Im ln e · . (B2) T !! X∈T It is important to emphasize that while there are many ways to symmetrize a model, we cannot use any aggre- gation operation – it must also preserve its probabilistic meaning for we to be able to sample from it efficiently. We propose to incorporate the possible symmetries into our generative model, assuming we first sample a trans- formation T from with equal probability 1/ , and then T draw a sample fromT our model as described in the main text, followed by transforming it with T . This translates to a mixture model over the squared magnitudes of the network’s predicted amplitude, and eq. B1 realizes it in log-space, where the real part of the output represent the log-magnitude. For the imaginary part, we have less restrictions and most symmetric operators would work, but we found that the mean of circular quantities of the phases, as expressed in eq. B2, worked best in our exper- iments.

3. Optimization

In our experiments we employ the following general op- timization strategy. We begin using the Adam [? ] SGD variant, using a small batch of 100 samples for estimat- ing the gradient and using a learning rate in the order