<<

Variational Neural Annealing

1, 2, 3, 1 1, 2 2, 3 1, 2 Mohamed Hibat-Allah, ∗ Estelle M. Inack, Roeland Wiersema, Roger G. Melko, and Juan Carrasquilla 1Vector Institute, MaRS Centre, Toronto, Ontario, M5G 1M1, Canada 2Department of Physics and Astronomy, University of Waterloo, Ontario, N2L 3G1, Canada 3Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada (Dated: January 26, 2021) Many important challenges in science and technology can be cast as optimization problems. When viewed in a statistical physics framework, these can be tackled by simulated annealing, where a gradual cooling procedure helps search for groundstate solutions of a target Hamiltonian. While powerful, simulated annealing is known to have prohibitively slow sampling dynamics when the optimization landscape is rough or glassy. Here we show that by generalizing the target distribution with a parameterized model, an analogous annealing framework based on the variational principle can be used to search for groundstate solutions. Modern autoregressive models such as recurrent neural networks provide ideal parameterizations since they can be exactly sampled without slow dynamics even when the model encodes a rough landscape. We implement this procedure in the classical and quantum settings on several prototypical spin glass Hamiltonians, and find that it significantly outperforms traditional simulated annealing in the asymptotic limit, illustrating the potential power of this yet unexplored route to optimization.

P I. INTRODUCTION AAACNHicfVDLSgMxFL1TX7W+Rl26CRbBVZkRQZdFN4KbCvYBbRkyaaYNzWSGJGMpQz/KjR/iRgQXirj1G8y0s6iteCBwOOfcJPf4MWdKO86rVVhZXVvfKG6WtrZ3dvfs/YOGihJJaJ1EPJItHyvKmaB1zTSnrVhSHPqcNv3hdeY3H6hULBL3ehzTboj7ggWMYG0kz76teSnq9KKRwFJGI9RJ4kUyZ/6Xm5Q8u+xUnCnQMnFzUoYcNc9+NjeSJKRCE46VartOrLsplpoRTielTqJojMkQ92nbUIFDqrrpdOkJOjFKDwWRNEdoNFXnJ1IcKjUOfZMMsR6oRS8T//LaiQ4uuykTcaKpILOHgoQjHaGsQdRjkhLNx4ZgIpn5KyIDLDHRpuesBHdx5WXSOKu4TsW9Oy9Xr/I6inAEx3AKLlxAFW6gBnUg8Agv8A4f1pP1Zn1aX7NowcpnDuEXrO8fzwWstw== #""##""

ExactAAACBnicbVDLSsNAFJ34rPUVdSnCYBFchUQEXZaK4LKCfUAbymQ6aYdOJmHmRlpDV278FTcuFHHrN7jzb5y0XWjrgYHDOfdy55wgEVyD635bS8srq2vrhY3i5tb2zq69t1/Xcaooq9FYxKoZEM0El6wGHARrJoqRKBCsEQyucr9xz5TmsbyDUcL8iPQkDzklYKSOfdQGNoTsekgo4Eos4CEiUuKuueyMix275DruBHiReDNSQjNUO/ZXuxvTNGISqCBatzw3AT8jCjgVbFxsp5olhA5Ij7UMlSRi2s8mMcb4xChdHMbKPAl4ov7eyEik9SgKzGREoK/nvVz8z2ulEF76GZdJCkzS6aEwFRhinHdi0ipGQYwMIVRx81dM+0SZSkxzeQnefORFUj9zPNfxbs9L5cqsjgI6RMfoFHnoApXRDaqiGqLoET2jV/RmPVkv1rv1MR1dsmY7B+gPrM8f6XuYvA== Boltzmann dist.

A wide array of complex combinatorial optimization SimulatedAAACBHicbVA9SwNBEN3z2/gVtUyzGASrcCeClkEbS0WjgeQIc5tJXNzbO3bnxHCksPGv2FgoYuuPsPPfuEmu0OiDgcd7M7szL0qVtOT7X97M7Nz8wuLScmlldW19o7y5dWWTzAhsiEQlphmBRSU1NkiSwmZqEOJI4XV0ezLyr+/QWJnoSxqkGMbQ17InBZCTOuVKm/Ce8gsZZwoIuxy0RnCv9YelTrnq1/wx+F8SFKTKCpx1yp/tbiKyGDUJBda2Aj+lMAdDUigcltqZxRTELfSx5aiGGG2Yj48Y8l2ndHkvMa408bH6cyKH2NpBHLnOGOjGTnsj8T+vlVHvKMylTjNCLSYf9TLFKeGjRHhXGhSkBo6AMNLtysUNGBDkchuFEEyf/Jdc7dcCvxacH1Trx0UcS6zCdtgeC9ghq7NTdsYaTLAH9sRe2Kv36D17b977pHXGK2a22S94H98J7phR annealing problems can be reformulated as finding the lowest en- VariationalAAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5KIoMeiF48V7Ae0oWy2m3bpZhN2J2II9a948aCIV3+IN/+NmzQHbX0w8Hhvhpl5fiy4Bsf5tipr6xubW9Xt2s7u3v6BfXjU1VGiKOvQSESq7xPNBJesAxwE68eKkdAXrOfPbnK/98CU5pG8hzRmXkgmkgecEjDSyK4PgT1C1iWKFwoR89rIbjhNpwBeJW5JGqhEe2R/DccRTUImgQqi9cB1YvAyooBTwea1YaJZTOiMTNjAUElCpr2sOH6OT40yxkGkTEnAhfp7IiOh1mnom86QwFQve7n4nzdIILjyMi7jBJiki0VBIjBEOE8Cj7liFERqCKGKm1sxnRJFKJi88hDc5ZdXSfe86TpN9+6i0bou46iiY3SCzpCLLlEL3aI26iCKUvSMXtGb9WS9WO/Wx6K1YpUzdfQH1ucPFBGVBg==

ergy configuration of an Ising Hamiltonian of the form [1]: TAAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxQr+gDWWz3bRrN7thdyOU0P/gxYMiXv0/3vw3btIctPXBwOO9GWbmBTFn2rjut1NaW9/Y3CpvV3Z29/YPqodHHS0TRWibSC5VL8CaciZo2zDDaS9WFEcBp91gepf53SeqNJOiZWYx9SM8FixkBBsrdVroBrmVYbXm1t0caJV4BalBgeaw+jUYSZJEVBjCsdZ9z42Nn2JlGOF0XhkkmsaYTPGY9i0VOKLaT/Nr5+jMKiMUSmVLGJSrvydSHGk9iwLbGWEz0cteJv7n9RMTXvspE3FiqCCLRWHCkZEoex2NmKLE8JklmChmb0VkghUmxgaUheAtv7xKOhd1z617D5e1xm0RRxlO4BTOwYMraMA9NKENBB7hGV7hzZHOi/PufCxaS04xcwx/4Hz+AID2jcE= =0

N

Htarget = Jijσiσj hiσi, (1) TAAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0ItQ9OKxQr8gDWWz3bRLN7thdyKE0J/hxYMiXv013vw3btoctPpg4PHeDDPzwkRwA6775VTW1jc2t6rbtZ3dvf2D+uFRz6hUU9alSig9CIlhgkvWBQ6CDRLNSBwK1g9nd4Xff2TacCU7kCUsiMlE8ohTAlbyO/gGD7mMIKuN6g236S6A/xKvJA1Uoj2qfw7HiqYxk0AFMcb33ASCnGjgVLB5bZgalhA6IxPmWypJzEyQL06e4zOrjHGktC0JeKH+nMhJbEwWh7YzJjA1q14h/uf5KUTXQc5lkgKTdLkoSgUGhYv/8ZhrRkFklhCqub0V0ynRhIJNqQjBW335L+ldND236T1cNlq3ZRxVdIJO0Tny0BVqoXvURl1EkUJP6AW9OuA8O2/O+7K14pQzx+gXnI9v0KmQSQ== = P

− − AAACNnicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxo1QwT6gLUMmzbShmcyQZCxl6Fe58TvcdeNCEbd+gpl2FG17IHA459wk9/gxZ0o7zsRaWV1b39gsbBW3d3b39u2Dw7qKEklojUQ8kk0fK8qZoDXNNKfNWFIc+pw2/MFN5jceqVQsEg96FNNOiHuCBYxgbSTPvqt6KUKoncRYymiI2t1oKBbpr70s96ONi55dcsrOFGiRuDkpQY6qZ7+Yi0gSUqEJx0q1XCfWnRRLzQin42I7UTTGZIB7tGWowCFVnXS69hidGqWLgkiaIzSaqn8nUhwqNQp9kwyx7qt5LxOXea1EB1edlIk40VSQ2UNBwpGOUNYh6jJJieYjQzCRzPwVkT6WmGjTdFaCO7/yIqmfl12n7N5flCrXeR0FOIYTOAMXLqECt1CFGhB4ggm8wbv1bL1aH9bnLLpi5TNH8A/W1zePZa0L iAAACOHicbVDLSsNAFL3xWesr6tLNYBFclUQEXRbduLOCfUBbwmQ6aYdOJmFmYimhn+XGz3Anblwo4tYvcNIGH20PXDicc+/MvcePOVPacZ6tpeWV1bX1wkZxc2t7Z9fe26+rKJGE1kjEI9n0saKcCVrTTHPajCXFoc9pwx9cZX7jnkrFInGnRzHthLgnWMAI1kby7Juql6IM7W40FFjKaIjaSTxLFpm/2o84Lnp2ySk7E6B54uakBDmqnv1kHiJJSIUmHCvVcp1Yd1IsNSOcjovtRNEYkwHu0ZahAodUddLJ4WN0bJQuCiJpSmg0Uf9OpDhUahT6pjPEuq9mvUxc5LUSHVx0UibiRFNBph8FCUc6QlmKqMskJZqPDMFEMrMrIn0sMdEm6ywEd/bkeVI/LbtO2b09K1Uu8zgKcAhHcAIunEMFrqEKNSDwAC/wBu/Wo/VqfVif09YlK585gH+wvr4BTjStXw== #""#"#" timization problem, and its solutions correspond to spin configurations σ that minimize H . While the low- { i} target Figure 1. Schematic illustration of the space of est energy states of certain families of Ising Hamiltoni- distributions visited during simulated annealing. An arbitrar- ans can be found with modest computational resources, ily slow SA visits a series of Boltzmann distributions starting most of these problems are hard to solve and belong to at the high temperature (e.g. T = ) and ending in the T = 0 the non-deterministic polynomial time (NP)-hard com- Boltzmann distribution (continuous∞ yellow line), where a per- plexity class [2]. fect solution to an optimization problem is reached. These Various have been used over the years to solutions are found either at the edge or a corner (for non- find approximate solutions to these NP-hard problems. degenerate problems) of the standard probabilistic simplex A notable example is simulated annealing (SA) [3], which (colored triangle plane). A practical, finite-time SA trajectory mirrors the analogous annealing process in materials sci- (red dotted line), as well as a variational classical annealing trajectory (green dashed line), deviate from the trajectory of ence and metallurgy where a crystalline solid is heated exact Boltzmann distributions. and then slowly cooled down to its lowest energy and most structurally stable crystal arrangement. In addi- tion to providing a fundamental connection between the annealing has been so successful that it has inspired in- arXiv:2101.10154v1 [cond-mat.dis-nn] 25 Jan 2021 thermodynamic behavior of real physical systems and complex optimization problems, simulated annealing has tense research into its quantum extension, which requires enabled scientific and technological advances with far- quantum hardware to anneal the tunneling amplitude, reaching implications in areas as diverse as operations and can be simulated in an analogous way to SA [11, 12]. research [4], artificial intelligence [5], biology [6], graph The SA explores an optimization problem’s theory [7], power systems [8], quantum control [9], cir- energy landscape via a gradual decrease in thermal cuit design [10] among many others [5]. The paradigm of fluctuations generated by the Metropolis-Hastings algo- rithm. The procedure stops when all thermal kinetics are removed from the system, at which point the solu- tion to the optimization problem is expected to be found. ∗ [email protected] While an exact solution to the optimization problem is al- 2 ways attained if the decrease in temperature is arbitrarily taken over the probability pλ(σ). The von Neumann slow, a practical implementation of the algorithm must entropy is given by necessarily run on a finite time scale [13]. As a conse- S (p ) = p (σ) log (p (σ)) , (3) quence, the annealing algorithm samples a series of effec- classical λ − λ λ σ tive, quasi-equilibrium distributions close but not exactly X equal to the stationary Boltzmann distributions targeted where the sum runs over all the elements of the state during the annealing [14] (see Fig.1 for a schematic illus- space σ . In our setting, the temperature is decreased { } tration). This naturally leads to approximate solutions from an initial value T0 to 0 using a linear schedule func- to the optimization problem, whose quality generally de- tion T (t) = T (1 t), where t [0, 1], which follows 0 − ∈ pends on the interplay between the problem complexity closely the traditional implementation of SA. and the rate at which the temperature is decreased. In order for VCA to succeed, we require parameterized In this paper, we offer an alternative route to solv- models that enable the estimation of entropy, Eq. (3), ing optimization problems of the form of Eq. (1), called without incurring expensive calculations of the partition variational neural annealing. Here, the conventional function. In addition, we anticipate that hard optimiza- simulated annealing formulation is substituted with the tion problems will induce a complex energy landscape annealing of a parameterized model. Namely, instead into the parameterized models and an ensuing slowdown of annealing and approximately sampling the exact of their sampling via Monte Carlo. These Boltzmann distribution, this approach anneals a quasi- issues preclude un-normalized models such as restricted equilibrium model, which must be sufficiently expressive Boltzmann machines, where sampling relies on Markov and capable of tractable sampling. Fortunately, suitable chains and whose partition function is intractable to eval- models have recently been provided by machine learning uate [23]. Instead, we implement VCA using recurrent technology [15–17]. In particular, neural autoregressive neural networks (RNNs) [20, 21], whose autoregressive models combined with variational principles have been nature enables statistical averages over exact samples σ shown to accurately describe the equilibrium properties drawn from pλ(σ). Since RNNs are normalized by con- of classical and quantum systems [18–21]. Here, we im- struction, these samples naturally allow the estimation of plement variational neural annealing using autoregres- the entropy in Eq. (3). We provide a detailed description sive recurrent neural networks, and show that they offer of the RNN in Methods Sec.VA. a powerful alternative to conventional SA and its analo- The VCA algorithm, summarized in Fig.2(a), per- gous quantum extension, i.e., simulated quantum anneal- forms a warm-up step which brings a randomly initialized ing (SQA) [11]. This powerful and unexplored route to distribution pλ(σ) to an approximate equilibrium state optimization is schematically illustrated in Fig.1, where with free energy Fλ(t = 0) via Nwarmup a variational neural annealing trajectory (dashed green steps. At each step t, we reduce the temperature of the arrow) is shown to provide a more accurate approxima- system from T (t) to T (t + δt) and apply Ntrain gradi- tion to the ideal trajectory (continuous yellow line) than ent descent steps to re-equilibrate the model. A critical a conventional SA run (dotted red line). ingredient to the success of VCA is that the variational parameters optimized at temperature T (t) are reused at temperature T (t + δt) to ensure that the model’s distri- II. VARIATIONAL CLASSICAL AND bution is always near its instantaneous equilibrium state. Repeating the last two steps Nannealing times, we reach temperature T (1) = 0, which is the end of the anneal- We first consider the variational approach to statistical ing protocol. Here the distribution pλ(σ) is expected mechanics [18, 22], where a distribution pλ(σ) defined by to assign high probability to configurations σ that solve a set of variational parameters λ is optimized to closely the optimization problem. Likewise, the residual entropy reproduce the equilibrium properties of a system at tem- Eq. (3) at T (1) = 0 provides a approach to perature T . Following the spirit of SA, we dub our first count the number of solutions to the problem Hamilto- variational neural annealing algorithm variational classi- nian [18]. Further algorithmic details are provided in cal annealing (VCA). Methods Sec.VB. The VCA algorithm searches for the ground state of an Simulated annealing provides a powerful heuristic for optimization problem, encoded in a target Hamiltonian the solution of hard optimization problems by harnessing Htarget, by slowly annealing the model’s variational free thermal fluctuations. Inspired by the latter, the advent of energy commercially available quantum devices [24] has enabled the analogous concept of quantum annealing [25], where Fλ(t) = Htarget λ T (t)Sclassical(pλ), (2) the solution to an optimization problem is performed by h i − harnessing quantum fluctuations. In quantum annealing, from a high temperature to a low temperature. The the search for the ground state of Eq. (1) is performed at quantity Fλ(t) provides an upper bound to the true in- T = 0, by supplementing the target Hamiltonian with a stantaneous free energy and can be used at each anneal- quantum mechanical kinetic (or “driving”) term, ing stage to update λ through gradient-descent tech- niques. The brackets ... denote ensemble averages Hˆ (t) = Hˆ + f(t)Hˆ , (4) h iλ target D 3

Figure 2. Variational neural annealing protocols. (a) The variational classical annealing (VCA) algorithm steps. A warm-up step brings the initialized variational state (green dot) close to the minimum of the free energy (cyan dot) at a given value of the order parameter M. This step is followed by an annealing and a training step that brings the variational state back to the new free energy minimum. Repeating the last two steps until T (t = 1) = 0 (red dots) produces approximate solutions to Htarget if the protocol is conducted slowly enough. This schematic illustration corresponds to annealing through a continuous phase transition with an order parameter M. (b) Variational quantum annealing (VQA). VQA includes a warm-up step, followed by an annealing and a training step, which brings the variational energy (green dot) closer to the new a ground state energy (cyan dot). We loop over the previous two steps until reaching the target ground state of Hˆtarget (red dot) if annealing is performed slowly enough.

where Htarget in Eq. (1) is promoted to a quantum me- applies Nwarmup gradient descent steps to minimize chanical Hamiltonian Hˆtarget. E(λ, t = 0), which brings Ψλ close to the ground state ˆ | i Quantum annealing typically start with a of H(0). Setting t = δt while keeping the parameters dominant driving term Hˆ Hˆ chosen so that λ0 fixed results in a variational energy E(λ0, t = δt). D  target the ground state of Hˆ (0) is easy to prepare. When the A set of Ntrain gradient descent steps bring the ansatz strength of the driving term is subsequently reduced (typ- closer to the new instantaneous ground state, which re- λ ically adiabatically) using a schedule function f(t), the sults in a variational energy E( 1, t = δt). The vari- ational parameters optimized at time step t are reused system is annealed to the ground state of Hˆ . In anal- target at time t + δt, which promotes the computational adi- ogy to its thermal counterpart, SQA emulates this pro- abaticity of the protocol (see Appendix.A). We repeat cess on classical computers using quantum Monte Carlo the annealing and training steps N times on a methods [11]. annealing linear schedule (f(t) = 1 t with t [0, 1]) until t = 1, Here, we leverage the variational principle of quantum at which point the system− should solve∈ the optimization mechanics and devise a strategy that emulates quan- problem (red dot in Fig.2(b)). We note that in our sim- tum annealing variationally. We dub our second vari- ulations, no training steps are taken at t = 1. Finally, ational neural annealing algorithm variational quantum similarly to VCA, we choose normalized RNN wave func- annealing (VQA). The latter is based on the variational tions [20, 21] as ans¨atze,giving the VQA algorithm access Monte Carlo (VMC) algorithm, whose goal is to simu- to exact Monte Carlo samples. late the equilibrium properties of quantum systems at zero temperature (see Methods Sec.VC). In VMC, the To gain theoretical insight on the principles behind a successful VQA simulation, we derive a variational ver- ground state of a Hamiltonian Hˆ is modeled through an sion of the adiabatic theorem [26]. Starting from a set of ansatz Ψ endowed with parameters λ. The varia- | λi assumptions, such as the convexity of the energy land- tional principle guarantees that the energy Ψ Hˆ Ψ h λ| | λi scape in the warm-up phase and close to convergence is an upper bound to the ground state energy of Hˆ , during annealing, as well as the absence of noise in the which we use to define a time-dependent objective func- energy gradients, we provide a bound on the total number tion E(λ, t) Hˆ (t) = Ψ Hˆ (t) Ψ to optimize the ≡ h iλ h λ| | λi of gradient descent steps Nsteps that guarantees the adia- parameters λ. baticity of the VQA algorithm as well as a success proba- The VQA setup, graphically summarized in Fig.2(b), bility of solving the optimization problem P > 1 . success − 4 (a) Here,  is an upper bound on the overlap between the 100 variational wave function and the excited states of the ˆ 2 Hamiltonian H(t), i.e., Ψ (t) Ψλ < . We show that 10 1 |h ⊥ | i| Nsteps can be bounded as (see Appendix.B): 2 10 poly(N) poly(N) Nsteps . 2 2 3 O  min(g(tn)) ≤ ≤ O  min(g(tn))  10

tn t /N { } { n}     res (5) ✏ 4 10 0.99 0.01 VQA (N = 32) 1/t ± The function g(t) is the energy gap between the first / 1.02 0.02 VQA (N = 64) 1/t ± excited state and the ground state of the instantaneous 5 / 10 1.08 0.06 VQA (N = 128) 1/t ± ˆ / Hamiltonian H(t), N is the system size, and the set of 1.53 0.01 VCA (N = 32) 1/t ± / times tn is defined in Appendix.B. As expected for 6 1.66 0.02 10 VCA (N = 64) 1/t ± { } / hard optimization problems, the minimum gap typically 1.85 0.04 VCA (N = 128) 1/t ± decreases exponentially with system size N, which dom- 7 / 10 inates the computational complexity of a VQA simula- 101 102 103 104 tion, but in cases where the minimum gap scales as the Nannealing inverse of a polynomial in N, then the number of steps (b) Nsteps is also polynomial in N. Figure100 3. Variational neural annealing on a random Ising chain. Here we represent the residual energy per site res/N 1 vs the10 number of annealing steps Nannealing for both VQA and III. RESULTS VCA. The system sizes are N = 32, 64, 128. We use random 2 positive10 couplings Ji,i+1 [0, 1) (see text for more details). ∈ A. Annealing on random Ising chains The error bars represent the one s.d. statistical uncertainty 3 calculated10 over different disorder realizations [28]. /N

We now proceed to evaluate the power of VCA and res

✏ 4 10 0.96 0.03 VQA (N = 32) 1/t ± VQA. As a first benchmark, we consider the task of solv- / 1.01 0.05 VQA (N = 64) 1/t ± ing for the ground state the one-dimensional (1D) Ising 5 / 10 1.05 0.04 We take advantageVQA (N of = 128) the autoregressive1/t ± nature of the Hamiltonian with random couplings Ji,i+1, 6 / 1.32 0.05 RNN and sampleVCA (N 10 = 32)configurations1/t ± at the end of the 6 / 10 1.28 0.05 N 1 annealing, whichVCA allows(N = 64) us1 to/t accurately± estimate the − / 1.51 0.06 H = J σ σ . (6) model’s arithmeticVCA (N mean. = 128) The1/t typical± mean is taken over target i,i+1 i i+1 7 − 10 / i=1 25 instances of Htarget. X 101 102 103 104 First, we examine J sampled from a uniform dis- i,i+1 In Fig.3 we report the residualNannealing energies per site against tribution in the interval [0, 1). Here, the ground state the number of annealing steps Nannealing. As expected, configuration is given either by all spins up or down, and the residual energy is a decreasing function of Nannealing, the ground state energy is known exactly, i.e., EG = which underlines the importance of adiabaticity and an- N 1 − J [27]. nealing in our setting. In our examples, we observe that − i=1 i,i+1 We use a tensorized RNN ansatz without weight shar- the decrease of the residual energy of VCA and VQA is ingP for both VCA and VQA (see Methods Sec.VA). consistent with a power-law decay for a large number of We consider system sizes N = 32, 64, 128 and Ntrain = 5, annealing steps. Whereas VCA’s decay exponent is in the which suffices to achieve accurate solutions. For VQA, we interval 1.5 1.9, the VQA exponent is about 0.9 1.1. ˆ N x − − use a one-body driving term HD = Γ0 i=1 σˆi , where These exponents suggest an asymptotic speed-up com- x,y,z − σˆi are Pauli matrices acting on site i. To quantify pared to SA and coherent quantum annealing, where the the performance of the algorithms, we useP the residual residual energies follow a logarithmic law [29]. Contrary energy [11], to the observations in Ref. [29] where quantum annealing was found superior to SA, VCA finds an average residual  = H E , (7) res h targetiav − G dis energy an order of magnitude more accurate than VQA where EG is the exact ground state energy of Htarget. We for a large number of annealing steps. use the arithmetic mean for statistical averages ... h iav Finally, we note that the exponents provided above are over samples from the models. For VCA it means that not expected to be universal and are a priori sensitive Htarget av Htarget λ, while for VQA the target Hamil- to the hyperparameters of the algorithms, e.g., learning h i ≈ h i N 1 z z tonian is promoted to Hˆ = − J σˆ σˆ target − i=1 i,i+1 i i+1 rate, model choice, number of training steps, optimizer, and H Hˆ . We consider the typical etc. Appendix.C provides a summary of the hyperpa- target av target λ P (geometric)h meani ≈ for h averagingi over instances of the tar- rameters used in our work. Additional illustrations of the get Hamiltonian, i.e., ... = exp( ln(...) ). The aver- adiabaticity of VCA and VQA, as well as of the anneal- dis h iav age in the argument of the exponential stands for arith- ing results for a chain with Ji,i+1 uniformly sampled from metic mean over different  realizations of the couplings. the discrete set 1, +1 , are provided in Appendix.A. {− } 5

As a final note, the exponents provided above are not B. Edwards-Anderson model a Nsteps expected to be universal and are a priori sensitive to the 101 102 103 104 105 hyperparameters of the algorithms (e.g., learning rate, We now consider the two-dimensional (2D) Edwards- 0 number of memory units d , number of training steps 10 Anderson (EA) model, which ish a prototypical spin glass N , gradient descent optimizer, number of samples, 1 train 10 arrangedetc), on which a square may openlattice up with avenues nearest to boost neighbor the ran- perfor- dom interactions. The problem of finding ground states 2 mance of our algorithms. For reproducibility purposes, 10 of theAppendix. model hasD provides been studied a summary experimentally of the hyperparameters [12] and 3 numerically [11] from the annealing perspective, as well 10 used to produce the results shown here. /N as theoretically [2] from the computational complexity res 4 perspective. The EA model with open boundary condi- ✏ 10 CQO tions is given by B. Edwards-Anderson model 5 10 VQA 1.2 0.2 6 RVQA 1/t ± H = J σ σ , (8) 10 We now considertarget the two-dimensionalij i j Edwards- VCA 1/t2.0 0.2 − ± Anderson (EA) model, whichi,j is a prototypical spin-glass 7 hXi 10 model where a set of spins are arranged on a square 100 101 102 103 104 6 lattice with nearest neighbor random interactions. The where i, j denote nearest neighbors. The couplings Jij Nannealing are drawnproblemh i from of finding a uniform ground distribution states of the in model the interval has been b 0 ods [5]. The SK Hamiltonian HˆSK is given by [ 1, 1).studied In the experimentally absence of a [ longitudinal76] and numerically field, for [55 which, 56, 68] 10 solving− from the the EA annealing model is perspective,NP-hard, the as ground well as theoretically state can be [2] Figure 3. A comparison between VCA, VQA, RVQA, and 10 1 1 Jij z z from the computational complexity perspective. In this CQO for Edwards-Anderson (EA) on a 10 10 lattice. The HˆSK = ˆ ˆ , (10) found in polynomial time [2]. To find the exact ground ⇥ 2 pN i j section, we use the EA model as a benchmark to fur- residual energy per site vs. Nannealing for VCA, VQA and i=j state of each random realization, we use the spin-glass 2 X6 ther probe VCA and VQA, and compare them against RVQA.10 For CQO, we report the residual energy per site vs. server [30]. the number of optimization steps N . steps where Jij is a symmetric matrix such that each matrix standard heuristics, namely, SA and SQA implemented /N 3 We use a 2D tensorized RNN ansatz without weight 10 { }

via discrete-time path-integral Monte Carlo [55, 68]. The res element Jij is sampled from a gaussian distribution with sharing for the variational protocols (see Methods ✏ EA model is given by 4 mean 0 and variance 1. Sec.VA). For VQA, we use a one-body driving term 10 While the resultsSA in Fig. 3 do show an amelioration of Since VCA performed best in our previous examples, N x z z HˆD = Γ0 σˆ .ˆ Fig.4(a) shows the annealing re- i=1 i HEA = Jijˆi ˆj , (8) the VQA5 performance,SQA including changing a saturating we use it to find ground states of the SK model for N = − 10 sults obtained on a system size Ni,j = 10 10 spins. VCA dynamics at long annealing time to a power-law like be- 100 spins. Here, exact ground states energies of the SK P hXi × VCA outperforms VQA and in the adiabatic, long-time anneal- havior, it6 appears to be insucient to compete with the model are calculated using the spin-glass server [77] on where the sum runs over nearest neighbors, and the cou- 10 ing regime, it produces solutions three orders of magni- VCA scaling. This suggests1 the superiority2 3 of a thermally4 a total of 25 instances of disorder. To account for long- plings J are drawn independently from a uniform dis- 10 10 10 10 tude more accurateij on average than VQA. In addition, we driven variational emulation of annealing over a quantum distance dependencies between spins in the SK model, tribution in the range [ 1, 1]. In the absence of a longi- Nannealing investigate the performance of VQA supplemented with one. we use a dilated RNN that has log (N) =7layers tudinal field for which solving the EA model is NP-hard, 2 a fictitious Shannon information entropy [21] term that (see Methods Sec. VB) and we startd the annealinge at an the ground state can be found in polynomial time [2]. FigureTo 4. further Benchmarking scrutinize the the relevance two-dimensional of the annealing Edwards- ef- mimics thermal relaxation effects observed in quantum Figurefects in 4. VCA, Comparison we also between consider Simulated VCA with Annealing zero thermal (SA), initial temperature T0 = 2. We compare our results with For each random realization of the couplings Jij,weuse Anderson spin glass. (a) A comparison between VCA, VQA, annealing hardware [31]. This form of regularized VQA, Path-Integralfluctuations, Quantumi.e., setting MonteT = Carlo 0. Because (SQA) of with its intimateP =20 SA and SQA. For SQA, we start with an initial magnetic the spin-glass server [77] to obtain the exact ground state RVQA,trotter and slices, CQO and on VCA a 10 using100 a lattice2D tensorized by plotting pRNN the state resid- for here labelled (RVQA), is described by a pseudo free en- relation to the classical-quantum× optimization methods field 0 = 2, while for SA we use T0 = 2. energy. This feature makes the EA model an ideal bench- ualthe energy EA model per site on a vs 40Nannealing40 lattice.. For We CQO, report we the report residual the ˜ ˆ 2 To e↵ectively compare the three methods (i.e., SA, ergy costmark function for ourF method,λ(t) = particularlyH(t) λ T ( fort)S largeclassical system( Ψλ sizes.). residualenergyof Ref. energy per51 site, 79 per, as and a site function80 vs⇥, we the of call number the this number ofsetting optimization of annealing CQO. Fig. steps3 h i − | 2 | 3 SQA, and VCA), we first plot the residual energy per As in VCA,To simulate the pseudo our variational entropy term neuralSclassical annealing( Ψλ protocols,) at NstepsNshows. (b) that Comparisonfor CQO SA, VCA takes between and about SQA. SA, 10 SQAtraining with P steps= 20 start- trot- | | annealing f(1) =we 0 use provides a 2D tensorized a heuristic RNN approach (see Methods to count Sec. theVB num-) as an tering slices, from and random VCAusing parameters a 2D tensorized initialization RNN to ansatz reach close on a site as a function of Nannealing for VCA, SA and SQA ber ofansatz solutions without to H weightfor sharing. VQA We and implement RVQA. The themeth- re- 40 to40 1% lattice. accuracy. The annealingThe accuracy speed does is the not same further for SA, improve SQA (with P = 100 trotter slices). Here, the SA and SQA target × 5 residual energies are obtained by averaging the outcome sultsods in Fig. described4(a) do in show Sec. II anand amelioration?? with VQA of the implemented VQA andFig.when VCA.4, trained where up we to present 10 gradient the residual steps, indicating energiesthat per site the of 50 independent annealing runs, while for VCA we av- performance,using a one-body including driving changing term. a saturatingFig. 3 shows dynamics the anneal- againstCQO limit the of number VCA is of prone annealing to getting steps stuckN in local,which min- annealing erage the outcome of 106 exact samples from the an- at largeingN results obtainedto a power-law on a system like size behavior.N = 10 10 How- spins. isima. set In so comparison, that the speed VCA of and annealing VQA o is↵er the solutions same for orders SA, annealing ⇥ of magnitude more accurate at long annealing times, sug- nealed RNN. For all methods, we take the typical aver- ever,As it appears for the random to be insufficient Ising chains to in compete Sec. III A with, VCA the out- curateSQA on and average VCA. forWe a first large note number that our of annealing results confirm steps, gesting the importance of the annealing e↵ect in tackling age over 25 disorder instances. The results are shown in VCAperforms scaling (see VQA exponents and in the in adiabatic, Fig.4(a)). long-time This observa- annealing highlightingthe qualitative the behaviorimportance of SA of andannealing SQA in in Refs. tackling [55, 68 op-]. optimization problems. Fig. 5(a). As observed in the EA model in Fig. 4, we note regime, VCA produces solutions three orders of magni- timizationWhile at problems. short annealing times SA and SQA produce tion suggests the superiority of a thermally driven varia- that for fast annealing runs SA and SQA produce lower tude more accurate than VQA. In addition, we investi- lowerSince residual VCA displays energy solutions the best performance than VCA, we in the observe pre- tional emulation of annealing over a purely quantum one Since VCA displays the best performance in the pre- residual energy solutions than VCA, but we emphasize gate the performance of VQA supplemented with a ficti- thatvious VCA benchmarks, achieves we residual use it energies to demonstrate for large its annealing capabili- for this example. vious benchmarks, we use it to demonstrate its capa- that VCA delivers a lower residual energy compared to tious Shannon information entropy [47] term that mimics timeties on about a relatively three orders large of system magnitude with 40 smaller40 thanspins. SQA For To further scrutinize the relevance of the annealing bilities on a 40 40 spin system. For comparison,⇥ we SQA and SA as the total annealing time increases past thermal relaxation e↵ects observed in quantum anneal- andcomparison, SA. Notably, we× use the SA rate as well at which as SQA the with residualP = 20 energy trot- effects in VCA, we also consider VCA with zero ther- use SA as well as SQA. The SQA simulation uses the N 103. Likewise, we observe that the rate at ing hardware [78] and induces a thermal-like exploration improvester slices, with and take increasing the average the annealing energy across time all is signifi-trotter annealing ⇠ mal fluctuations, i.e., setting T0 = 0. Because of its path-integral [11] with P = 20 trot- which the residual energy improves with increasing the of the energy landscape during the quantum annealing cantlyslices, higherfor each in realization VCA than of SQA randomness and SA (see even Methods at rela- intimate relation to the classical-quantum optimization ter slices, and we report averages over energies across total annealing time is significantly higher in VCA than emulation. This form of regularized variational quan- tivelySec. VE short). In annealing addition, time. we average These observations the energy highlight obtained (CQO) methods of Refs. [32–34], we refer to this setting all trotter slices, for each realization of randomness (see SQA and SA. tum annealing (RVQA) is described by a free energy cost theafter advantages 25 annealing of solving runs on hard every optimization instance of problemsrandomness in as CQO. Fig.4(a) shows that CQO takes about 10 3 train- Methods Sec.VD). In addition, we average the energy A more detailed look at the statistical behaviour of the function: afor variational SA and SQA. space To compared average over to SA Hamiltonian and SQA paradigms. instances, ing steps to reach accuracies nearing 1%. The accuracy obtainedwe use afterthe typical 25 annealing mean over runs 25 on di every↵erent instance realizations of ran- for methods at long annealing times can be obtained from ˜ ˆ 2 does not furtherF(t)= improveH(t) upon T additional(t)Sclassical( training (t) ) up. to(9) domness for SA and SQA. To average over Hamiltonian the residual energy histograms separately produced by h i | | the three annealing methods. The results are shown in 105 gradient steps, which indicates that CQO is prone instances, we use the typical mean over 25 different re- each method, as shown in Fig. 5(e). For each instance C. Fully-connected spin glasses to getting stuck in local minima. In comparison, VCA alizations for the three annealing methods. The results Jij after the end of annealing, we represent the ob- { } and VQA offer solutions orders of magnitude more ac- are shown in Fig.4(b), where we present the residual tained residual energies in a histogram form. For the We now focus our attention on fully-connected spin three methods, we extract 103 residual energies for each glasses [2, 81]. We first focus on the Sherrington- disorder realization. Here, we observe that VCA is supe- Kirkpatrick (SK) model [82], which provides a concep- rior to SA and SQA, as it produces a higher density of tual framework for the understanding of the role of dis- low residual energies. This indicates that, even though order and frustration in widely diverse systems ranging VCA typically takes more annealing steps, it ultimately from materials to combinatorial optimization and ma- results in a higher chance of getting more accurate solu- chine learning. The combined e↵ect of disorder and long- tions to optimization problems than their SA and SQA range interactions in the SK model results in an energy counterparts. landscape characterized by a hierarchy of valleys with a We now focus on the Wishart planted ensemble number of local minima growing exponentially in the sys- (WPE), which is a class of zero-field Ising models with a tem size [81]. Together with the fact that many combina- first-order phase transition and tunable algorithmic hard- torial NP-hard problems can be thought of as the task of ness [83]. These problems belong to a special class of hard finding a ground state of a densely connected spin glass, problem ensembles whose solutions are known to the con- the properties above make fully connected spin glasses structor, which, together with the tunability of the hard- a suitable benchmark for heuristic optimization meth- ness, makes the WPE model an ideal tool to benchmark 6 energies per site against the number of annealing steps A more detailed look at the statistical behaviour of Nannealing, which is set so that the speed of annealing is the methods at large Nannealing can be obtained from the the same for SA, SQA and VCA. We first note that our residual energy histograms separately produced by each results confirm the qualitative behavior of SA and SQA method, as shown in Fig.5(d). The histograms contain in Refs. [11, 35]. While SA and SQA produce lower resid- 1000 residual energies for each of the same 25 disorder ual energy solutions than VCA for small Nannealing, we realizations. For each instance, we plot results for 1000 observe that VCA achieves residual energies about three SA runs, 1000 samples obtained from the RNN at the orders of magnitude smaller than SQA and SA for a large end of annealing for VCA, and 10 SQA runs including number of annealing steps. Notably, the rate at which the contribution from each of the P = 100 Trotter slices. residual energy improves with increasing Nannealing is sig- We observe that VCA is superior to SA and SQA, as it nificantly higher for VCA compared to SQA and SA even produces a higher density of low energy configurations. at relatively small number of annealing steps. This indicates that, even though VCA typically takes more annealing steps, it ultimately results in a higher chance of getting more accurate solutions to optimization C. Fully-connected spin glasses problems than SA and SQA. Note that for the SK model, the SQA histogram remain quantitatively the same for We now focus our attention on fully-connected spin 200 runs, and we report data of 10 runs only for fairness glasses [2, 36]. We first focus on the Sherrington- purposes compared to both SA and VCA. Kirkpatrick (SK) model [37], which provides a concep- We now focus on the Wishart planted ensemble tual framework for the understanding of the role of dis- (WPE), which is a class of zero-field Ising models with a order and frustration in widely diverse systems ranging first-order phase transition and tunable algorithmic hard- from materials to combinatorial optimization and ma- ness [38]. These problems belong to a special class of chine learning. The SK Hamiltonian is given by hard problem ensembles whose solutions are known a pri- ori, which, together with the tunability of the hardness, 1 J makes the WPE model an ideal tool to benchmark heuris- H = ij σ σ , (9) target −2 √ i j tic algorithms for optimization problems. The Hamilto- i=j N X6 nian of the WPE model is defined as where Jij is a symmetric matrix such that each matrix 1 α { } Htarget = J σiσj. (10) element Jij is sampled from a gaussian distribution with −2 ij i=j mean 0 and variance 1. X6 Since VCA performed best in our previous examples, Here J α is a symmetric matrix satisfying we use it to find ground states of the SK model for N = ij 100 spins. Here, exact ground states energies of the SK J α = J˜α diag(J˜) model are calculated using the spin-glass server [30] on − a total of 25 instances of disorder. To account for long- distance dependencies between spins in the SK model, we and use a dilated RNN ansatz that has log2(N) = 7 layers 1 d e J˜α = W W T. (see Methods Sec.VA) and set the initial temperature −N α α T0 = 2. We compare our results with SA and SQA. For SQA, we start with an initial magnetic field Γ = 2, while The term W is an N αN random matrix satisfy- 0 α × b c for SA we use T0 = 2. ing Wαtferro = 0 where tferro = (+1, +1, ..., +1) is the For an effective comparison, we first plot the resid- ferromagnetic state (see Ref. [38] for details about the ual energy per site as a function of Nannealing for VCA, generation of Wα). The ground state of the WPE model SA and SQA (with P = 100 trotter slices). Here, the is known (i.e., it is planted) and corresponds to the ferro- SA and SQA residual energies are obtained by averag- magnetic states tferro. Interestingly, α is a tunable pa- ing the outcome of 50 independent annealing runs, while rameter of hardness,± where for α < 1 this model displays for VCA we average the outcome of 106 exact samples a first-order transition, such that near zero temperature from the annealed RNN. For all methods, we take the the paramagnetic states are meta-stable solutions [38]. typical average over 25 disorder instances. The results This feature makes this model hard to solve with any an- are shown in Fig.5(a). As observed in the EA model, nealing method, as the paramagnetic states are numerous we note that SA and SQA produce lower residual energy compared to the two ferromagnetic states and hence act solutions than VCA for small Nannealing, but we empha- as a trap for a typical annealing method. We benchmark size that VCA delivers a lower residual energy compared the three methods (SA, SQA and VCA) for N = 32 and to SQA and SA as the total number of annealing steps α 0.25, 0.5 . 3 ∈ { } α increases past Nannealing 10 . Likewise, we observe We consider 25 instances of the couplings Jij and that the rate at which the∼ residual energy improves with attempt to solve the model with VCA implemented{ } using increasing N is significantly higher for VCA in a dilated RNN ansatz with log (N) = 5 layers and an annealing d 2 e comparison to SQA and SA. initial temperature T0 = 1. For SQA (P = 100 trotter 7

Figure 5. Benchmarking SA, SQA (P = 100 trotter slices) and VCA on the Sherrington-Kirkpatrick (SK) model and the Wishart planted ensemble (WPE). Panels (a),(b), and (c) display the residual energy per site as a function of Nannealing. (a) The SK model with N = 100 spins. (b) WPE with N = 32 spins and α = 0.5. (c) WPE with N = 32 spins and α = 0.25. Panels (d), (e) and (f) display the residual energy histogram for each of the different techniques and models in panels (a),(b), and (c), respectively. The histograms use 25000 data points for each method. Note that we choose a minimum threshold of 10 10− for res/N, which is within our numerical accuracy.

slices), we use an initial magnetic field Γ0 = 1, and for IV. CONCLUSIONS AND OUTLOOK SA we start with T0 = 1. In conclusion, we have introduced a strategy to com- bat the slow sampling dynamics encountered by simu- lated annealing when an optimization landscape is rough or glassy. Based on annealing the variational parameters of a generalized target distribution, our scheme — which We first plot the scaling of residual energies per site we dub variational neural annealing — takes advantage res/N as shown in Figs.5(b) and (c). Here we note that of the power of modern autoregressive models, which can VCA is superior to SA and SQA for α = 0.5 as demon- be exactly sampled without slow dynamics even when strated in Fig.5(b). More specifically, VCA is about a rough landscape is encountered. We implement varia- three orders of magnitude more accurate than SQA and tional neural annealing parameterized by a recurrent neu- SA for a large number of annealing steps. In the case ral network, and compare its performance to conventional of α = 0.25 in Fig.5(c), VCA is competitive where simulated annealing on prototypical spin glass Hamiltoni- it achieves a similar performance compared to SA and ans known to have landscapes of varying roughness. We SQA on average for a large number of annealing steps. find that variational neural annealing produces accurate We also represent the residual energies in a histogram solutions to all of the optimization problems considered, form. We observe that for α = 0.5 in Fig.5(e), VCA including spin glass Hamiltonians where our techniques achieves a higher density toward low residual energies typically reach solutions orders of magnitude more accu- 9 10 res/N 10− -10− compared to SA and SQA. For rate on average than conventional simulated annealing in α = 0.25∼ in Fig.5(f), VCA leads to a non-negligible the limit of a large number of annealing steps. density at very low residual energies as opposed to SA We emphasize that several hyperparameters, model, and SQA, whose solutions display residual energies or- hardware, and variational objective function choices can ders of magnitude higher. Finally, our WPE simulations be explored and may improve our methodologies. We support the observation that VCA tends to improve the have utilized a simple annealing schedule in our protocols quality of solutions faster than SQA and SA for a large and highlight that can be used to number of annealing steps. improve it [39]. A critical insight gleaned from our exper- 8 iments is that certain neural network architectures were For disordered systems, it is natural to forgo the com- more efficient on specific Hamiltonians. Thus, a natu- mon practice of weight sharing [41] of W, U, b and c in ral direction is to study the intimate relation between Eqs. (12), (13) and use an extended set of site-dependent N the model architecture and the problem Hamiltonian, variational parameters λ comprised of Wn n=1 and N N N { } where we envision that symmetries and domain knowl- Un n=1 and biases bn n=1, cn n=1. The recursion edge would guide the design of models and algorithms. {relation} and the Softmax{ } layer are{ } modified to As we witness the unfolding of a new age for opti- hn = F (Wn[hn 1; σn 1] + bn), (15) mization powered by deep learning [40], we anticipate − − a rapid adoption of machine learning techniques in the and space of combinatorial optimization, as well as antici- pate domain-specific applications of our ideas in diverse pλ(σn σ

| hn = F σn 1Tnhn 1 + bn , (17) A. Ans¨atze − − where σ| is the transpose of σ, and the variational pa- N N N N Recurrent neural networks model complex probability rameters λ are Tn , Un , bn and cn . { }n=1 { }n=1 { }n=1 { }n=1 distributions p by taking advantage of the chain rule This form of tensorized RNN increases the expressiveness of our ansatz as illustrated in Appendix.D. p(σ) = p(σ1)p(σ2 σ1) p(σN σN 1, . . . , σ2, σ1), (11) | ··· | − For two-dimensional systems, we make use of a 2D- dimensional extension of the recursion relation in vanilla where specifying every conditional probability p(σi σ

The joint probability distributionP pλ(σ) is given by (l) (l) (l) (l 1) (l) hn = F (Wn [hmax(0,n 2l 1); hn− ] + bn ). − − pλ(σ) = pλ(σ1)pλ(σ2 σ1) pλ(σN σ

of spins and ... is the ceiling function. This means d e 2 that two spins are connected10 with a path whose length is bounded by (log (N)), which follows the spirit of the O 2 multi-scale renormalization3 ansatz [46]. For more details on the advantage of dilated10 RNNs over tensorized RNNs see Appendix.D. We finally note that for all the0 RNN200 architectures400 600 in 800 1000 our work, we found accurate results using theTraining exponential step b linear unit (ELU) activation function, defined as: (b) bx, 100 if x 0 , ELU(x) = ≥ (exp(x) 1, if x < 0 . 1 − 10

RNN with weight sharing 2

B. Minimizing the variational2 free energy 10 RNN with no weight sharing

To implement the variational3 classical annealing algo- 10 rithm, we use the variational free energy

4 Fλ(T ) = Htarget10λ TSclassical(pλ), (20) h i − 0 200 400 600 800 1000 where the target Hamiltonian Htarget encodesTraining the op- step timization problem and T is the temperature. More- (c) c over, Sclassical is the entropy of the distribution pλ. To (i) estimate Fλ(T ) we takec Ns exact samples σ pλ ∼ (i = 1,...,Ns) drawn from the RNN and evaluate

Ns 101 1 (i) Fλ(T ) Floc(σ ), 2 ≈ F Ns

i=1 X where the local free energy is Floc(σ) = Htarget(σ) + T log (pλ(σ)) [18]. Similarly,2 the gradientsTensorized are RNN given by 10 Dilated RNN N 1 s ∂ F (T ) ∂ log p σ(i) λ λ ≈ N λ 0 λ 200 400 600 800 1000 s i=1 X    Training step F (σ(i)) F (T ) , × loc − λ   where we subtract F (T ) in order to reduce noise in the Figure 6. (a) An illustration of a 1D RNN: at each site n, the λ RNN cell denoted by the green box, receives a hidden state gradients [18, 20]. We note that this variational scheme exhibits a zero-variance principle, namely that the local hn 1 and the one-hot spin vector σn 1, to generate a new − − hidden state hn that is fed into a Softmax layer (denoted by free energy variance per spin a magenta circle). (b) A graphical illustration of a 2D RNN. var( Floc(σ) ) Each RNN cell receives two hidden states hi,j 1 and hi 1,j , 2 − − σF { } , (21) as well as two input vectors σi,j 1 and σi 1,j (not shown) as ≡ N − − illustrated by the black arrows. The red arrows correspond to becomes zero when p matches the Boltzmann distribu- the zigzag path we use for 2D autoregressive sampling. The λ tion, provided that mode collapse is avoided [18]. initial memory state h0 of the RNN and the initial inputs σ0 (not shown) are null vectors. (c) An illustration of a dilated The gradient updates are implemented using the Adam RNN, where the distance between each two RNN cells grows optimizer [47]. Furthermore, the computational complex- 2 exponentially with depth to account for long-term dependen- ity of VCA for one gradient descent step is (Ns N d ) O × × h cies. We choose depth L = log2(N) where N is the number for 1D RNNs and 2D RNNs (both vanilla and tensorized d e 2 of spins. versions) and (Ns N log(N) dh) for dilated RNNs. Consequently,O VCA× has lower computational× cost than VQA, which is implemented using VMC (see Methods Sec.VC). 10

Finally, we note that in our implementations no train- Similarly to the minimization scheme of the variational ing steps are performed at the end of annealing for both free energy in Methods Sec.VB, VMC also exhibits a VCA and VQA. zero-variance principle, where the energy variance per spin var( E (σ) ) C. Variational Monte Carlo σ2 { loc } , (24) ≡ N

The main goal of Variational Monte Carlo is to approx- becomes zero when Ψλ matches an excited state of Hˆ , imate the ground state of a Hamiltonian Hˆ through the which thanks to the| minimizationi of the variational en- iterative optimization of an ansatz wave function Ψλ . ergy E is likely to be the ground state ΨG . The VMC objective function is given by | i | i The gradients ∂λ log (Ψλ (σ)) are numerically com- puted using automatic differentiation [52]. We use the Ψ Hˆ Ψ E h λ| | λi. Adam optimizer to perform gradient descent updates, ≡ Ψλ Ψλ with a learning rate η, to optimize the variational param- h | i eters λ of the RNN wave function. We note that in the We note that an important class of stoquastic many- presence of (N) non-diagonal elements in a Hamilto- body Hamiltonians has ground states Ψ with strictly O nian Hˆ , the local energies E (σ) have (N) terms (see real and positive amplitudes in the standard| i product spin loc Eq. (23)). Thus, the computational complexityO of one basis [48]. These ground states can be written down in gradient descent step is (N N 2 d2 ) for 1D RNNs terms of probability distributions, s h and 2D RNNs (both vanillaO and× tensorized× versions). Ψ = Ψ(σ) σ = P (σ) σ . (22) | i | i | i σ σ X X p D. Simulated Quantum Annealing and Simulated To approximate this family of states, we use an RNN Annealing wave function, namely Ψλ(σ) = pλ(σ). Extensions to complex-valued RNN wave functions are defined in Simulated Quantum Annealing is a standard quantum- Ref. [20], and results on their abilityp to simulate vari- inspired classical technique that has traditionally been ational quantum annealing of non-stoquastic Hamilto- used to benchmark the behavior of quantum anneal- nians [49] will be reported elsewhere [50]. These fami- ers [24]. It is usually implemented via the path-integral lies of RNN states are normalized by construction (i.e., Monte Carlo method [11], a QMC method that simu- Ψ Ψ = 1) and allow for accurate estimates of the lates equilibrium properties of quantum systems at finite h λ| λi energy expectation value. By taking Ns exact samples temperature. To illustrate this method, consider a D- σ(i) p (i = 1,...,N ), it follows that dimensional time-dependent quantum Hamiltonian ∼ λ s N N 1 s ˆ z z x (i) H(t) = Jijσˆi σˆj Γ(t) σˆi , E Eloc(σ ). − − ≈ N i,j i=1 s i=1 X X X where Γ(t) = Γ (1 t) controls the strength of the quan- The local energy is given by 0 tum annealing dynamics− at a time t [0, 1]. By applying 0 ∈ Ψλ(σ ) the Suzuki-Trotter formula to the partition function of Eloc(σ) = Hσσ0 , (23) the quantum system, Ψλ(σ) σ0 X Z = Tr exp βHˆ (t) , (25) where the sum over σ0 is tractable when the Hamiltonian {− } 1 Hˆ is local. Similarly, we can also estimate the energy with the inverse temperature β = T , we can map the D- gradients as dimensional quantum Hamiltonian onto a (D + 1) clas- sical system consisting of P coupled replicas (Trotter N 2 s slices) of the original system ∂ E = ∂ log Ψ σ(i) E σ(i) E . λ N λ λ loc − s i=1 P N X        k k k k+1 HD+1(t) = Jijσi σj + J (t) σi σi , −  ⊥  Here, we can subtract the term E in order to reduce noise i,j i=1 kX=1 X X in the estimation of our gradients without in-  (26) troducing a bias [20, 51]. In fact, when the ansatz is close k where σi is the classical spin at site i and replica k. The to an eigenstate of Hˆ , then E (σ) E, which means k loc term J (t) corresponds to uniform coupling between σi ≈ ⊥ that the variance of gradients Var(∂λ E) 0 for each k+1 j ≈ and σi for each site i, such that variational parameter λj. We note that this is similar in spirit to the control variate methods in Monte Carlo and PT Γ(t) J (t) = ln tanh . to the baseline methods in reinforcement learning [51]. ⊥ − 2 PT    11

We note that periodic boundary conditions σP +1 σ1 ACKNOWLEDGMENTS arise because of the trace in Eq. (25). ≡ Interestingly, we can approximate Z with an effective We acknowledge Jack Raymond for suggesting to use partition function Zp at temperature PT given by [35]: the Wishart Planted Ensemble as a benchmark for our variational annealing setup. We also thank Christopher Roth, Cunlu Zhou, Martin Ganahl and Giuseppe Santoro HD+1(t) Zp Tr exp , for fruitful discussions. We are also grateful to Lauren ∝ − PT   Hayward for providing her plotting code to produce our figures using Matplotlib library. Our RNN implementa- which can now be simulated with a standard Metropolis- tion is based on Tensorflow and NumPy. We acknowledge Hastings Monte Carlo algorithm. A key element to this support from the Natural Sciences and Engineering Re- algorithm is the energy difference induced by a single spin search Council (NSERC), a Canada Research Chair, the k flip at site σi , which is equal to Shared Hierarchical Academic Research Computing Net- work (SHARCNET), Compute Canada, Google Quan- k k k 1 k k k+1 tum Research Award, and the Canadian Institute for ∆iElocal = 2 Jijσi σj + 2J (t) σi − σi + σi σi . ⊥ j Advanced Research (CIFAR) AI chair program. Re- X  sources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Here, the second term encodes the quantum dynamics. In Canada through CIFAR, and companies sponsoring the our simulations we consider single spin flip (local) moves Vector Institute www.vectorinstitute.ai/#partners. applied to all sites in all slices. We can also perform a Research at Perimeter Institute is supported in part by global move [35], which means flipping a spin at location the Government of Canada through the Department of i in every slice k. Clearly this has no impact on the Innovation, Science and Economic Development Canada term dependent on J , because it contains only terms ⊥ and by the Province of Ontario through the Ministry of quadratic in the flipped spin, so that Economic Development, Job Creation and Trade.

P k k ∆iEglobal = 2 Jijσi σj . Appendix A: Numerical proof of principle of j kX=1 X adiabaticity

In summary, a single Monte Carlo step (MCS) consists As demonstrated in Sec.III, we have shown that both of first performing a single local move on all sites in each VQA and VCA are effective at finding the classical k-th slice and on all slices, followed by a global move for ground state of disordered spin chains. Here, we fur- all sites. For the SK model and the WPE model studied ther illustrate the adiabaticity of both VQA and VCA. in this paper, we use P = 100, whereas for the EA model First, we perform VQA on the uniform ferromagnetic we use P = 20 similarly to Ref. [11]. Before starting Ising chain (i.e., Ji,i+1 = 1) with N = 20 spins and the quantum annealing schedule, we first thermalize the open boundary conditions with an initial transverse field system by performing SA [35] from a temperature T0 = 3 Γ0 = 2. Here, we use a tensorized RNN wave func- to a final temperature 1/P (so that PT = 1). This is tion with weight sharing across sites of the chain. We done in 60 steps, where at each temperature we perform also choose Nannealing = 1024. In Fig.7(a), we show 100 Metropolis moves on each site. We then perform that the variational energy tracks the exact ground en- SQA using a linear schedule that decreases the field from ergy throughout the annealing process with high accu- 8 Γ0 to a final value close to zero Γ(t = 1) = 10− , where racy. We also observe that optimizing an RNN wave five local and global moves are performed for each value function from scratch, i.e., randomly reinitializing the of the magnetic field Γ(t), so that it is consistent with the parameters of the model at each new value of the trans- choice of Ntrain = 5 for VCA (see Sec.II andIIIA). Thus, verse magnetic field is not optimal. This observation un- the number of MCS is equal to five times the number of derlines the importance of transferring the parameters of annealing steps. our wave function ansatz after each annealing step. Fur- For the standalone SA, we decrease the temperature thermore, in Fig.7(b) we illustrate that the RNN wave 8 from T0 to T (t = 1) = 10− . Here, a single MCS consists function’s residual energy is much lower compared to the of a Monte Carlo sweep, i.e., attempting a spin-flip for all gap throughout the annealing process, which shows that sites. For each thermal annealing step, we perform five VQA remains adiabatic for a large number of annealing MCS, and hence similar to SQA, the number of MCS is steps. equal to fives times the number of annealing steps. Fur- Similarly, in Fig.7(c) we perform VCA with an initial thermore, we do a warm-up step for SA, by performing temperature T0 = 2 on the same model, the same system Nwarmup MCS to equilibrate the Markov Chain at the size, the same ansatz, and the same number of annealing initial temperature T0 and to provide a consistent choice steps. We see an excellent agreement between the RNN with VCA (see Sec.II). wave function free energy and the exact free energy, high- (a) 100

1 10

2 10

3 10 /N res

✏ 4 10 0.99 0.01 VQA (N = 32) 1/t ± / 1.02 0.02 VQA (N = 64) 1/t ± 5 / 10 1.08 0.06 VQA (N = 128) 1/t ± / 1.53 0.01 VCA (N = 32) 1/t ± / 6 1.66 0.02 10 VCA (N = 64) 1/t ± / 1.85 0.04 VCA (N = 128) 1/t ± 7 / 10 101 102 103 104 12 Nannealing (b) 100 a 0 Random parameters Transferred parameters 1 10 10 Exact energy 2 10 i

ˆ 20 H 3

h 10 /N res

30 ✏ 4 10 0.96 0.03 VQA (N = 32) 1/t ± / 1.01 0.05 VQA (N = 64) 1/t ± 5 / 40 10 1.05 0.04 VQA (N = 128) 1/t ± / 1.32 0.05 VCA (N = 32) 1/t ± 6 / 0.0 0.5 1.0 1.5 2.0 10 1.28 0.05 VCA (N = 64) 1/t ± / 1.51 0.06 VCA (N = 128) 1/t ± 7 10 / 101 102 103 104 b Gap 2.0 Nannealing RNN residual energy 1.5 Figure 8. Variational annealing on random Ising chains, where we represent the residual energy per site res/N vs

res 1.0 Nannealing for both VQA and VCA. The system sizes are ✏ N = 32, 64, 128 and we use random discrete couplings Ji,i+1 ∈ 0.5 1, 1 . {− }

0.0 lighting once again the adiabaticity of our emulation of 0.0 0.5 1.0 1.5 2.0 classical annealing, as well as the importance of trans- ferring the parameters of our ansatz after each annealing step. Taken all together, the results in Fig.7 support the c notion that VQA and VCA evolutions can be adiabatic. 0 Random parameters In Fig.8 we report the residual energies per site against Transferred parameters the number of annealing steps Nannealing. Here, we Exact Free energy consider Ji,i+1 uniformly sampled from the discrete set 10 1, +1 , where the ground state configuration is dis- ) {− } T ordered and the ground state energy is given by E =

( G N 1 F 20 i=1− Ji,i+1 = (N 1). The decay exponents for VCA− are| in the| interval− − 1.3 1.6 and the VQA expo- nentP are approximately 1. These− exponents also suggest 30 an asymptotic speed-up compared to SA and coherent quantum annealing, where the residual energies follow 0.0 0.5 1.0 1.5 2.0 a logarithmic law [29, 53–55]. The latter confirms the T robustness of the observations in Fig.3.

Figure 7. Numerical evidence of adiabaticity on the uniform Ising chain with N = 20 spins for VQA in panels (a) and Appendix B: The variational adiabatic theorem (b) and VCA in panel (c). (a) Variational energy of RNN wave function against the transverse magnetic field Γ, with λ In this section, we derive a sufficient condition for the initialized using the parameters optimized in the previous an- number of gradient descent steps needed to maintain the nealing step (transferred parameters, green curve) and with variational ansatz close to the instantaneous ground state random parameter reinitialization (random parameters, pur- throughout the VQA simulation. First, consider a vari- ple curve). These strategies are compared with the exact en- ational wave function Ψ and the following the time- ergy obtained from exact diagonalization (dashed black line). | λi (b) Residual energy of the RNN wave function vs the trans- dependent Hamiltonian: verse field Γ. Throughout annealing with VQA, the resid- Hˆ (t) = Hˆ + f(t)Hˆ , ual energy is always much smaller than the gap within error target D bars. (c) Variational free energy vs temperature T for a VCA The goal is to find the ground state of the target run with λ initialized using the parameters optimized in the Hamiltonian Hˆ by introducing quantum fluctuations previous annealing step (transferred parameters, purple line) target through a driving Hamiltonian Hˆ , where Hˆ Hˆ . and with random reinitialization (random parameters, orange D D  target curve). Here f(t) is a decreasing schedule function such that f(0) = 1, f(1) = 0 and t [0, 1]. ∈ 13

Let E(λ, t) = Ψλ Hˆ (t) Ψλ , and EG(t),EE(t) the • (A7) The variational wave function Ψλ is expres- instantaneous ground/excitedh | | statei energy of the Hamil- sive enough, i.e., | i tonian Hˆ (t), respectively. The instantaneous energy gap g(t) is defined as g(t) EE(t) EG(t). min res(λ, t) < , t [0, 1]. To simplify our≡ discussion,− we consider the case of a λ 4 ∀ ∈ target Hamiltonian that has a non-degenerate ground Note that this assumption is also -dependent. state. Here, we decompose the variational wave function as: • (A8) At t = 0, the energy landscape of E(λ, t = 0) is globally convex with respect to λ. 1 1 Ψλ = (1 a(t)) 2 ΨG(t) + a(t) 2 Ψ (t) , (B1) | i − | i | ⊥ i Theorem Given the assumptions (A1) to (A8), a where Ψ (t) is the instantaneous ground state and sufficient (but not necessary) number of gradient descent | G i Ψ (t) is a superposition of all the instantaneous excited steps Nsteps to satisfy the condition (B4) during the VQA states.| ⊥ i From this decomposition, one can show that [56]: protocol, is bounded as:

E(λ, t) EG(t) poly(N) poly(N) a(t) − . (B2) N , g(t) steps 2 2 ≤ O  min(g(tn)) ≤ ≤ O  min(g(tn))  tn t { } n As a consequence, in order to satisfy adiabaticity, i.e.,    { }  2 Ψ (t) Ψλ 1 for all times t, then one should have where (t1, t2, t3,...) is an increasing finite sequence of | h ⊥ | i |  a(t) <  1 where  is a small upper bound on the time steps, satisfying t1 = 0 and tn+1 = tn + δtn, where overlap between the variational wave function and the g(t ) excited states. This means that the success probability δt = n . n O poly(N) Psuccess of obtaining the ground state at t = 1 is bounded   from below by 1 . From Eq. (B2), to satisfy a(t) < , Proof: In order to satisfy the condition Eq. (B4) dur- it is sufficient to− have: ing the VQA protocol, we follow these steps:  (λ, t) E(λ, t) E (t) < g(t). (B3) res ≡ − G • Step 1 (warm-up step): we prepare our variational wave function at the ground state at t = 0 such To satisfy the latter condition, we require a slightly that Eq. (B4) is verified at time t = 0. stronger condition as follows: • Step 2 (annealing step): we change time t by an g(t) infinitesimal amount δt, so that the condition (B3) res(λ, t) < . (B4) 2 is verified at time t + δt. In our derivation of a sufficient condition on the number • Step 3 (training step): we tune the parameters of of gradient descent steps to satisfy the previous require- the variational wave function, using gradient de- ment, we use the following set of assumptions: scent, so that the condition (B4) is satisfied at time t + δt. • (A1) ∂kE (t) , ∂kg(t) , ∂kf(t) (poly(N)), | t G | | t | | t | ≤ O for all 0 t 1 and for k 1, 2 . Step 4: we loop over steps 2 and 3 until we arrive at ≤ ≤ ∈ { } • t = 1, where we expect to obtain the ground state ˆ • (A2) Ψλ HD Ψλ (poly(N)) for all possible energy of the target Hamiltonian. parameters|h | λ of| thei|variational ≤ O wave function. Let us first start with step 2 assuming that step 1 is • (A3) No anti-crossing during annealing, i.e., g(t) = verified. In order to satisfy the requirement of this step 0, for all 0 t 1. 6 ≤ ≤ at time t, then δt has to be chosen small enough so that • (A4) The gradients ∂λE(λ, t) can be calculated res(λt, t + δt) < g(t + δt) (B5) exactly, are L(t)-Lipschitz with respect to λ and L(t) (poly(N)) for all 0 t 1. is verified given that the condition (B4) is satisfied at ≤ O ≤ ≤ time t. Here, λt are the parameters of the variational • (A5) Local convexity, i.e., close to convergence wave function that satisfies the condition (B4) at time t. when res(λ, t) < g(t), the energy landscape of To get a sense of how small δt should be, we do a Taylor E(λ, t) is convex with respect to λ, for all 0 < expansion, while fixing the parameters λ , to get: t 1. t ≤ Note that this assumption is -dependent. res(λt, t + δt) =  (λ , t) + ∂  (λ , t)δt + ((δt)2), • (A6) The parameters vector λ is bounded by a res t t res t O λ g(t) polynomial in N. i.e., (poly(N)), where < + ∂  (λ , t)δt + ((δt)2), we define “ . ” as the euclidean|| || ≤ OL norm. 2 t res t O || || 2 14 where we used the condition (B4) to go from the second To estimate the scaling of the number of gradient descent line to the third line. Here, ∂  (λ , t) = ∂ f(t) Hˆ steps Ntrain(t) needed to satisfy (B8), we make use of t res t t h Di − ∂tEG(t). To satisfy the condition (B3) at time t + δt, assumptions (A4) and (A5). The assumption (A5) is it is enough to have the right hand side of the previous reasonable providing that the variational energy E(λt, t+ inequality to be much smaller than the gap at t + δt, i.e., δt) is very close to the ground state energy EG(t + δt), as given by Eq. (B5). Using the above assumptions and g(t) 2 assuming that the learning rate η(t) = 1/L(t), we can + ∂tres(λt, t)δt + ((δt) ) < g(t + δt). 2 O use a well-known result in [57](see By Taylor expanding the gap, we get: Sec. 2.1.5), which states the following inequality: 2L(t) λ λ 2 2 g(t) 2 ˜ t t∗+δt ∂tres(λt, t)δt + ((δt) ) < + ∂tg(t)δt + ((δt) ), E(λt, t + δt) min E(λ, t + δt) || − || . O 2 O − λ ≤ Ntrain(t) + 4 hence, it is enough to satisfy the following condition: Here, λ˜t are the new variational parameters obtained af- ter applying Ntrain(t+δt) gradient descent steps starting 2 g(t) (∂tres(λt, t) ∂tg(t))δt + ((δt) ) < . (B6) from λ . Furthermore, λ are the optimal parameters − O 2 t t∗+δt such that: Using the Taylor-Laplace formula, one can express the Taylor remainder term ((δt)2) as follows: E(λt∗+δt, t + δt) = min E(λ, t + δt). O λ t+δt 2 Since the Lipschitz constant L(t) (poly(N)) (as- ((δt) ) = (τ t)A(τ)dτ, 2 ≤ O sumption (A4)) and λt λ∗ (poly(N)) (as- O t − t+δt Z sumption (A6)), one can|| take− || ≤ O 2 2 2 ˆ where A(τ) = ∂τ res(λt, τ) ∂τ g(τ) = ∂τ f(τ) HD 2 2 − h i − poly(N) ∂ EG(τ) ∂ g(τ) and τ is between t and t + δt. The N (t + δt) = , (B9) τ − τ train O g(t + δt) last expression can be bounded as follows:   t+δt 2 with a suitable (1) prefactor, so that: 2 (δt) O ((δt) ) (τ t) A(τ) dτ sup( A ). g(t + δt) O ≤ t − | | ≤ 2 | | ˜ Z E(λt, t + δt) min E(λ, t + δt) < . − λ 4 where “sup( A )” is the supremum of A over the interval [0, 1]. Given| assumptions| (A1) and (A2)| | , then sup( A ) Moreover, by assuming that the variational wave function is bounded from above by a polynomial in N, hence:| | is expressive enough (assumption (A7)), i.e., 2 2 g(t + δt) ((δt) ) (poly(N))(δt) (poly(N))δt, min E(λ, t + δt) EG(t + δt) < , O ≤ O ≤ O λ − 4 where the last inequality holds since δt 1 as t [0, 1], we can then deduce, by taking λ λ˜ and summing while we note that it is not necessarily≤ tight. Further-∈ t+δt t the two previous inequalities, that: ≡ more, since (∂tres(λt, t) ∂tg(t)) is also bounded from above by a polynomial in− N (according to assumptions g(t + δt) E(λ , t + δt) E (t + δt) < . (A1) and (A2)), then in order to satisfy Eq. (B6), it is t+δt − G 2 sufficient to require the following condition: Let us recall that in step 1, we have to initially pre- g(t) pare the variational ansatz to satisfy condition (B4) at (poly(N))δt < . O 2 t = 0. In fact, we can take advantage of the assump- tion (A4), where the gradients are L(0)-Lipschitz with Thus, it is sufficient to take: L(0) (poly(N)). We can also use the convexity as- ≤ O g(t) sumption (A8), and we can show that a sufficient num- δt = . (B7) O poly(N) ber of gradient descent steps to satisfy condition (B4) at   t = 0 is estimated as: By taking account of assumption (A3), δt can be taken poly(N) non-zero for all time steps t. As a consequence, assuming Nwarmup Ntrain(0) = . ≡ O g(0) the condition (B7) is verified for a non-zero δt and a   suitable (1) prefactor, then the condition (B5) is also The latter can be obtained in a similar way as in Eq. (B9). O verified. In conclusion, the total number of gradient steps Nsteps We can now move to step 3. Here, we apply a number to evolve the Hamiltonian Hˆ (0) to the target Hamilto- of gradient descent steps Ntrain(t) to find a new set of nian Hˆ (1), while verifying the condition (B4) is given parameters λt+δt such that: by:

g(t + δt) Nannealing+1 res(λt+δt, t+δt) = E(λt+δt, t+δt) EG(t+δt) < , − 2 Nsteps = Ntrain(tn), (B8) n=1 X 15 where each Ntrain(tn) satisfies the requirement (B9). The Appendix C: Default Hyperparameters annealing times t Nannealing+1 are defined such that { n}n=1 t1 0 and tn+1 tn + δtn. Here, δtn satisfies In this Appendix, we summarize the architectures and ≡ ≡ the hyperparameters of the simulations performed in this g(t ) δt = n . (B10) paper, as shown in Tab.I. The latter has shown to yield n O poly(N)   good performance, while we believe that a more advanced We also consider Nannealing the smallest integer such study of the hyperparameters can result in optimal re- sults. We also note that in this paper, VQA and VCA that tNannealing + δtNannealing 1, in this case, we define t 1, indicating the≥ end of annealing. Thus, were run using a single GPU workstation for each simula- Nannealing+1 ≡ Nannealing is the total number of annealing steps. Taking tion, while SQA and SA were performed on a multi-core this definition into account, then one can show that CPU. 1 Nannealing + 1. ≤ min(δtn) Appendix D: Benchmarking Recurrent neural t { n} network cells Using Eqs. (B7) and (B9) and the previous inequality, Nsteps can be bounded from above as: To show the advantage of tensorized RNNs over vanilla RNNs, we benchmark these architectures on the task of Nsteps (Nannealing + 1) max (Ntrain(tn)) ≤ tn finding the ground state of the uniform ferromagnetic { } Ising chain (i.e., Ji,i+1 = 1) with N = 100 spins at the 1 critical point (i.e., no annealing is employed). Since the + 2 max (Ntrain(tn)) ≤ min(δtn)  tn couplings in this model are site-independent, we choose t { } { n} the parameters of the model to be also site-independent.   2 poly(N) In Fig.9(a), we plot the energy variance per site σ (see , Eq. (24)) against the number of gradient descent steps. 2 2 ≤ O  min(g(tn))  2 tn Here σ is a good indicator of the quality of the optimized  { }  wave function [59, 61, 62]. The results show that the where the transition from line 2 to line 3 is valid for tensorized RNN wave function can achieve both a lower a sufficiently small  and min t (g(tn)). Furthermore, { n} estimate of the energy variance and a faster convergence. Nsteps can also be bounded from below as: For the disordered systems studied in this paper, we set the weights Tn,Un and the biases bn, cn (in Eqs. (16) poly(N) and (17)) to be site-dependent. To demonstrate the ben- Nsteps max(Ntrain(tn)) = . (B11) ≥ tn O  min(g(tn)) efit of using site-dependent over site-independent param- { } t { n} eters when dealing with disordered systems, we bench-   Note that the minimum in the previous two bounds are mark both architectures on the task of finding the ground taken over all the annealing times t where 1 n state of the disordered Ising chain with random discrete n ≤ ≤ N + 1. couplings Ji,i+1 = 1 at the critical point, i.e., with a annealing ± In this derivation of the bound on Nsteps, we have as- transverse field Γ = 1. We show the results in Fig.9(b) and find that site-dependent parameters lead to a better sumed that the ground state of Hˆtarget is non-degenerate, so that the gap does not vanish at the end of annealing performance in terms of the energy variance per spin. (i.e., t = 1). In the case of degeneracy of the target Furthermore, we equally show the advantage of a di- ground state, we can define the gap g(t) by considering lated RNN ansatz compared to a tensorized RNN ansatz. the lowest energy level that does not lead to the degen- We train both of them for the task of finding the min- erate ground state. imum of the free energy of the Sherrington-Kirkpatrick It is also worth noting that the assumptions of this model with N = 20 spins and at temperature T = 1, derivation can be further expanded and improved. In as explained in Methods Sec.VB. Both RNNs have a particular, the gradients of E(λ, t) are computed stochas- comparable number of parameters (66400 parameters for tically (see Methods Sec.VC), as opposed to our as- the tensorized RNN and 59240 parameters for the dilated sumption (A4) where the gradients are assumed to be RNN). Interestingly, in Fig.9(c), we find that the dilated known exactly. To account for noisy gradients, it is RNN supersedes the tensorized RNN with almost an or- possible to use convergence bounds of stochastic gradi- der of magnitude difference in term of the free energy ent descent [47, 58] to estimate a bound on the num- variance per spin defined in Eq. (21). Indeed, this result ber of gradient descent steps. Second-order optimization suggests that the mechanism of skip connections allows methods such as stochastic reconfiguration/natural gra- dilated RNNs to capture long-term dependencies more dient [59, 60] can potentially show a significant advantage efficiently compared to tensorized RNNs. over first-order optimization methods, in terms of scaling with the minimum gap of the time-dependent Hamilto- nian Hˆ (t). 16

Figures Parameter Value Architecture Tensorized RNN wave function with no-weight sharing Number of memory units dh = 40 Number of samples Ns = 50 Figs.3 and8 Initial magnetic field for VQA Γ0 = 2 Initial temperature for VCA T0 = 1 4 Learning rate η = 5 10− × Warmup steps Nwarmup = 1000 Number of random instances Ninstances = 25 Architecture 2D tensorized RNN wave function with no weight-sharing Number of memory units dh = 40 Number of samples Ns = 25 Initial magnetic field Γ = 1 (for SQA, VQA and RVQA) Fig.4 0 Initial temperature T0 = 1 (for SA, VCA and RVQA) 4 Learning rate η = 10− Number of warmup steps N = 1000 for 10 10 and N = 2000 for 40 40 warmup × warmup × Number of random instances Ninstances = 25 Architecture Dilated RNN wave function with no weight-sharing Number of memory units dh = 40 Number of samples Ns = 50 Initial temperature T = 2 (for SA and VCA) Figs.5(a) and (d) 0 Initial magnetic field Γ0 = 2 (for SQA) 4 Learning rate η = 10− Number of warmup steps Nwarmup = 2000 Number of random instances Ninstances = 25 Architecture Dilated RNN wave function with no weight-sharing Number of memory units dh = 20 Number of samples Ns = 50 Initial temperature T = 1 (for SA and VCA) Figs.5(b), (c), (e) and (f) 0 Initial magnetic field Γ0 = 1 (for SQA) 4 Learning rate η = 10− Number of warmup steps Nwarmup = 1000 Number of random instances Ninstances = 25 Architecture Tensorized RNN wave function with weight sharing Number of memory units dh = 20 Number of samples Ns = 50 Fig.7 Initial temperature T0 = 2 Initial magnetic field Γ0 = 2 3 Learning rate η = 10− Number of warmup steps Nwarmup = 1000 Architecture RNN wave function Number of memory units d = 50 Figs.9(a) and (b) h Number of samples Ns = 50 3 4 Learning rate η = 10− for Fig.9(a) and η = 5 10− for Fig.9(b) Architecture RNN wave function with no-weight× sharing Number of memory units of dilated RNN dh = 20 Fig.9(c) Number of memory units of tensorized RNN dh = 40 Number of samples Ns = 100 4 Learning rate η = 10−

Table I. Hyperparameters used to obtain the results reported in this paper. Note that the number of samples stands for the batch size used to train the RNN. 17 (a) a 100 Vanilla RNN a Tensorized RNN 1 10 2 2 10

3 10

0 200 400 600 800 1000 Training step b (b) b 100

1 10

RNN with weight sharing 2

2 10 RNN with no weight sharing

3 10

4 10 0 200 400 600 800 1000 Training step c (c) c

1 10 2 F

2 Tensorized RNN 10 Dilated RNN

0 200 400 600 800 1000 Training step

Figure 9. Energy (or Free energy) variance per spin σ2 vs the number of training steps. (a) We compare tensorized and vanilla RNN ansatzes both with weight sharing across sites on the uniform ferromagnetic Ising chain at the critical point with N = 100 spins. (b) Comparison between a tensorized RNN with and without weight sharing, trained to find the ground state of the random Ising chain with discrete disorder (Ji,i+1 = 1) at criticality with N = 20 spins. (c) Compar- ison between± a tensorized RNN and dilated RNN ansatzes, both with no weight sharing, trained to find the Sherrington- Kirkpatrick model’s equilibrium distribution with N = 20 spins at temperature T = 1. 18

[1] Andrew Lucas, “Ising formulations of many np prob- 2011) pp. 29–37. lems,” Front. Phys. 2, 5 (2014). [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob [2] F Barahona, “On the computational complexity of ising Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, spin glass models,” Journal of Physics A: Mathematical and Illia Polosukhin, “Attention is all you need,” (2017), and General 15, 3241–3253 (1982). arXiv:1706.03762 [cs.CL]. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Opti- [18] Dian Wu, Lei Wang, and Pan Zhang, “Solving statisti- mization by simulated annealing,” Science 220, 671–680 cal mechanics using variational autoregressive networks,” (1983). Physical Review Letters 122 (2019), 10.1103/phys- [4] C Koulamas, SR Antony, and R Jaen, “A survey of revlett.122.080602. simulated annealing applications to operations research [19] Or Sharir, Yoav Levine, Noam Wies, Giuseppe Car- problems,” Omega 22, 41 – 56 (1994). leo, and Amnon Shashua, “Deep autoregressive mod- [5] Bruce Hajek, “A tutorial survey of theory and applica- els for the efficient variational simulation of many-body tions of simulated annealing,” in 1985 24th IEEE Con- quantum systems,” Physical Review Letters 124 (2020), ference on Decision and Control (1985) pp. 755–760. 10.1103/physrevlett.124.020503. [6] D.I. Svergun, “Restoring low resolution structure of bi- [20] Mohamed Hibat-Allah, Martin Ganahl, Lauren E. Hay- ological macromolecules from solution scattering using ward, Roger G. Melko, and Juan Carrasquilla, “Recur- simulated annealing,” Biophysical Journal 76, 2879 – rent neural network wave functions,” Physical Review 2886 (1999). Research 2 (2020), 10.1103/physrevresearch.2.023358. [7] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, [21] Christopher Roth, “Iterative retraining of quantum and Catherine Schevon, “Optimization by simulated an- spin models using recurrent neural networks,” (2020), nealing: An experimental evaluation; part ii, graph color- arXiv:2003.06228 [physics.comp-ph]. ing and number partitioning,” Operations Research 39, [22] R.P. Feynman, Statistical Mechanics: A Set of Lectures, 378–406 (1991). Advanced Books Classics (Avalon Publishing, 1998). [8] M. A. Abido, “Robust design of multimachine power sys- [23] Philip M. Long and Rocco A. Servedio, “Restricted boltz- tem stabilizers using simulated annealing,” IEEE Trans- mann machines are hard to approximately evaluate or actions on Energy Conversion 15, 297–304 (2000). simulate,” in Proceedings of the 27th International Con- [9] Torsten Karzig, Armin Rahmani, Felix von Oppen, and ference on International Conference on Machine Learn- Gil Refael, “Optimal control of majorana zero modes,” ing, ICML’10 (Omnipress, Madison, WI, USA, 2010) p. Phys. Rev. B 91, 201404 (2015). 703–710. [10] Georges Gielen, Herman Walscharts, and Willy Sansen, [24] Sergio Boixo, Troels F Rønnow, Sergei V Isakov, Zhihui “Analog circuit design optimization based on symbolic Wang, David Wecker, Daniel A Lidar, John M Martinis, simulation and simulated annealing,” in ESSCIRC ’89: and Matthias Troyer, “Evidence for quantum annealing Proceedings of the 15th European Solid-State Circuits with more than one hundred qubits,” Nat. Phys. 10, 218– Conference (1989) pp. 252–255. 224 (2014). [11] Giuseppe E. Santoro, Roman Martoˇn´ak,Erio Tosatti, [25] Tadashi Kadowaki and Hidetoshi Nishimori, “Quantum and Roberto Car, “Theory of quantum annealing of an annealing in the transverse ising model,” Physical Review ising spin glass,” Science 295, 2427–2430 (2002). E 58, 5355–5363 (1998). [12] J. Brooke, D. Bitko, T. F. Rosenbaum, and G. Aeppli, [26] M. Born and V. Fock, “Beweis des adiabatensatzes,” “Quantum annealing of a disordered magnet,” Science Zeitschrift f¨urPhysik 51, 165–180 (1928). 284, 779–781 (1999). [27] Glen Bigan Mbeng, Lorenzo Privitera, Luca Arceci, and [13] Debasis Mitra, Fabio Romeo, and Alberto Sangiovanni- Giuseppe E. Santoro, “Dynamics of simulated quantum Vincentelli, “Convergence and finite-time behavior of annealing in random ising chains,” Phys. Rev. B 99, simulated annealing,” Advances in Applied Probability 064201 (2019). 18, 747–771 (1986). [28] Nilan Norris, “The standard errors of the geometric and [14] Daniel Delahaye, Supatcha Chaimatanan, and Marcel harmonic means and their application to index num- Mongeau, “Simulated annealing: From basics to applica- bers,” The Annals of Mathematical Statistics 11, 445– tions,” in Handbook of , edited by Michel 448 (1940). Gendreau and Jean-Yves Potvin (Springer International [29] Tommaso Zanca and Giuseppe E. Santoro, “Quantum Publishing, Cham, 2019) pp. 1–35. annealing speedup over simulated annealing on random [15] Ilya Sutskever, James Martens, and Geoffrey Hinton, ising chains,” Phys. Rev. B 93, 224431 (2016). “Generating text with recurrent neural networks,” in [30] “https://software.cs.uni-koeln.de/spinglass/,”. Proceedings of the 28th International Conference on In- [31] Neil G Dickson, MW Johnson, MH Amin, R Harris, ternational Conference on Machine Learning, ICML’11 F Altomare, AJ Berkley, P Bunyk, J Cai, EM Chapple, (Omnipress, Madison, WI, USA, 2011) p. 1017–1024. P Chavez, et al., “Thermally assisted quantum annealing [16] Hugo Larochelle and Iain Murray, “The neural autore- of a 16-qubit problem,” Nature communications 4, 1–6 gressive distribution estimator,” in Proceedings of the (2013). Fourteenth International Conference on Artificial Intelli- [32] Joseph Gomes, Keri A. McKiernan, Peter Eastman, gence and Statistics, Proceedings of Machine Learning and Vijay S. Pande, “Classical quantum optimiza- Research, Vol. 15, edited by Geoffrey Gordon, David tion with neural network quantum states,” (2019), Dunson, and Miroslav Dud´ık (JMLR Workshop and arXiv:1910.10675 [cond-mat.dis-nn]. Conference Proceedings, Fort Lauderdale, FL, USA, [33] Semyon Sinchenko and Dmitry Bazhanov, “The deep 19

learning and statistical physics applications to the and Barbara M. Terhal, “The complexity of stoquastic problems of combinatorial optimization,” (2019), local hamiltonian problems,” Quantum Info. Comput. 8, arXiv:1911.10680 [cond-mat.dis-nn]. 361–385 (2008). [34] Tianchen Zhao, Giuseppe Carleo, James Stokes, and [49] I. Ozfidan, C. Deng, A.Y. Smirnov, T. Lanting, R. Har- Shravan Veerapaneni, “Natural evolution strategies ris, L. Swenson, J. Whittaker, F. Altomare, M. Bab- and quantum approximate optimization,” (2020), cock, C. Baron, A.J. Berkley, K. Boothby, H. Chris- arXiv:2005.04447 [quant-ph]. tiani, P. Bunyk, C. Enderud, B. Evert, M. Hager, A. Ha- [35] Roman Martoˇn´ak, Giuseppe E. Santoro, and Erio jda, J. Hilton, S. Huang, E. Hoskinson, M.W. Johnson, Tosatti, “Quantum annealing by the path-integral monte K. Jooya, E. Ladizinsky, N. Ladizinsky, R. Li, A. Mac- carlo method: The two-dimensional random ising Donald, D. Marsden, G. Marsden, T. Medina, R. Molavi, model,” Phys. Rev. B 66, 094203 (2002). R. Neufeld, M. Nissen, M. Norouzpour, T. Oh, I. Pavlov, [36] M Mezard, G Parisi, and M Virasoro, Spin Glass I. Perminov, G. Poulin-Lamarre, M. Reis, T. Prescott, Theory and Beyond (WORLD SCIENTIFIC, 1986) C. Rich, Y. Sato, G. Sterling, N. Tsai, M. Volkmann, https://www.worldscientific.com/doi/pdf/10.1142/0271. W. Wilkinson, J. Yao, and M.H. Amin, “Demonstration [37] David Sherrington and Scott Kirkpatrick, “Solvable of a nonstoquastic hamiltonian in coupled superconduct- model of a spin-glass,” Phys. Rev. Lett. 35, 1792–1796 ing flux qubits,” Phys. Rev. Applied 13, 034037 (2020). (1975). [50] Mohamed Hibat-Allah, Estelle M. Inack, Roger G. [38] Firas Hamze, Jack Raymond, Christopher A. Pattison, Melko, and Juan Carrasquilla, (Manuscript in prepa- Katja Biswas, and Helmut G. Katzgraber, “Wishart ration). planted ensemble: A tunably rugged pairwise ising model [51] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and with a first-order phase transition,” Physical Review E Andriy Mnih, “Monte carlo gradient estimation in ma- 101 (2020), 10.1103/physreve.101.052102. chine learning,” (2019), arXiv:1906.10652 [stat.ML]. [39] Kyle Mills, Pooya Ronagh, and Isaac Tamblyn, “Con- [52] Shi-Xin Zhang, Zhou-Quan Wan, and Hong Yao, “Au- trolled online optimization learning (cool): Finding the tomatic differentiable monte carlo: Theory and applica- ground state of spin hamiltonians with reinforcement tion,” (2019), arXiv:1911.09117 [physics.comp-ph]. learning,” (2020), arXiv:2003.00011 [physics.comp-ph]. [53] Sei Suzuki, “Cooling dynamics of pure and random ising [40] Yoshua Bengio, Andrea Lodi, and Antoine Prou- chains,” Journal of Statistical Mechanics: Theory and vost, “Machine learning for combinatorial opti- Experiment 2009, P03032 (2009). mization: A methodological tour d’horizon,” Eu- [54] Jacek Dziarmaga, “Dynamics of a quantum phase transi- ropean Journal of Operational Research (2020), tion in the random ising model: Logarithmic dependence https://doi.org/10.1016/j.ejor.2020.07.063. of the defect density on the transition rate,” Phys. Rev. [41] Ian Goodfellow, Yoshua Bengio, and Aaron B 74, 064416 (2006). Courville, Deep Learning (MIT Press, 2016) [55] Tommaso Caneva, Rosario Fazio, and Giuseppe E. San- http://www.deeplearningbook.org. toro, “Adiabatic quantum dynamics of a random ising [42] Richard Kelley, “Sequence modeling with recurrent ten- chain across its quantum critical point,” Phys. Rev. B sor networks,” (2016). 76, 144427 (2007). [43] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao [56] Sandro Sorella and Federico Becca, SISSA Lecture notes Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark on Numerical methods for strongly correlated electrons Hasegawa-Johnson, and Thomas S. Huang, “Dilated (Sec. 1.3) (2016). recurrent neural networks,” (2017), arXiv:1710.02224 [57] Yurii Nesterov, “Smooth convex optimization,” in Lec- [cs.AI]. tures on Convex Optimization (Springer International [44] Y. Bengio, P. Simard, and P. Frasconi, “Learning Publishing, Cham, 2018) pp. 59–137. long-term dependencies with gradient descent is diffi- [58] Mark Schmidt, Nicolas Le Roux, and Francis Bach, cult,” IEEE Transactions on Neural Networks 5, 157–166 “Minimizing finite sums with the stochastic average gra- (1994). dient,” (2013), arXiv:1309.2388 [math.OC]. [45] Salah El Hihi and Yoshua Bengio, “Hierarchical recur- [59] F. Becca and S. Sorella, Quantum Monte Carlo Ap- rent neural networks for long-term dependencies,” in proaches for Correlated Systems (Cambridge University Advances in Neural Information Processing Systems 8 , Press, 2017). edited by D. S. Touretzky, M. C. Mozer, and M. E. Has- [60] Shun-ichi Amari, “Natural gradient works efficiently selmo (MIT Press, 1996) pp. 493–499. in learning,” Neural Computation 10, 251–276 (1998), [46] G. Vidal, “Class of quantum many-body states that can https://doi.org/10.1162/089976698300017746. be efficiently simulated,” Physical Review Letters 101 [61] Claudius Gros, “Criterion for a good variational wave (2008), 10.1103/physrevlett.101.110501. function,” Phys. Rev. B 42, 6835–6838 (1990). [47] Diederik P. Kingma and Jimmy Ba, “Adam: A method [62] Roland Assaraf and Michel Caffarel, “Zero-variance zero- for ,” (2014), arXiv:1412.6980 bias principle for observables in quantum monte carlo: [cs.LG]. Application to forces,” The Journal of Chemical Physics [48] Sergey Bravyi, David P. Divincenzo, Roberto Oliveira, 119, 10536–10552 (2003).