Variational Neural Annealing
1, 2, 3, 1 1, 2 2, 3 1, 2 Mohamed Hibat-Allah, ∗ Estelle M. Inack, Roeland Wiersema, Roger G. Melko, and Juan Carrasquilla 1Vector Institute, MaRS Centre, Toronto, Ontario, M5G 1M1, Canada 2Department of Physics and Astronomy, University of Waterloo, Ontario, N2L 3G1, Canada 3Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada (Dated: January 26, 2021) Many important challenges in science and technology can be cast as optimization problems. When viewed in a statistical physics framework, these can be tackled by simulated annealing, where a gradual cooling procedure helps search for groundstate solutions of a target Hamiltonian. While powerful, simulated annealing is known to have prohibitively slow sampling dynamics when the optimization landscape is rough or glassy. Here we show that by generalizing the target distribution with a parameterized model, an analogous annealing framework based on the variational principle can be used to search for groundstate solutions. Modern autoregressive models such as recurrent neural networks provide ideal parameterizations since they can be exactly sampled without slow dynamics even when the model encodes a rough landscape. We implement this procedure in the classical and quantum settings on several prototypical spin glass Hamiltonians, and find that it significantly outperforms traditional simulated annealing in the asymptotic limit, illustrating the potential power of this yet unexplored route to optimization.
P I. INTRODUCTION
Exact
A wide array of complex combinatorial optimization Simulated
ergy configuration of an Ising Hamiltonian of the form [1]: T
N
Htarget = Jijσiσj hiσi, (1) T
− −
Figure 2. Variational neural annealing protocols. (a) The variational classical annealing (VCA) algorithm steps. A warm-up step brings the initialized variational state (green dot) close to the minimum of the free energy (cyan dot) at a given value of the order parameter M. This step is followed by an annealing and a training step that brings the variational state back to the new free energy minimum. Repeating the last two steps until T (t = 1) = 0 (red dots) produces approximate solutions to Htarget if the protocol is conducted slowly enough. This schematic illustration corresponds to annealing through a continuous phase transition with an order parameter M. (b) Variational quantum annealing (VQA). VQA includes a warm-up step, followed by an annealing and a training step, which brings the variational energy (green dot) closer to the new a ground state energy (cyan dot). We loop over the previous two steps until reaching the target ground state of Hˆtarget (red dot) if annealing is performed slowly enough.
where Htarget in Eq. (1) is promoted to a quantum me- applies Nwarmup gradient descent steps to minimize chanical Hamiltonian Hˆtarget. E(λ, t = 0), which brings Ψλ close to the ground state ˆ | i Quantum annealing algorithms typically start with a of H(0). Setting t = δt while keeping the parameters dominant driving term Hˆ Hˆ chosen so that λ0 fixed results in a variational energy E(λ0, t = δt). D target the ground state of Hˆ (0) is easy to prepare. When the A set of Ntrain gradient descent steps bring the ansatz strength of the driving term is subsequently reduced (typ- closer to the new instantaneous ground state, which re- λ ically adiabatically) using a schedule function f(t), the sults in a variational energy E( 1, t = δt). The vari- ational parameters optimized at time step t are reused system is annealed to the ground state of Hˆ . In anal- target at time t + δt, which promotes the computational adi- ogy to its thermal counterpart, SQA emulates this pro- abaticity of the protocol (see Appendix.A). We repeat cess on classical computers using quantum Monte Carlo the annealing and training steps N times on a methods [11]. annealing linear schedule (f(t) = 1 t with t [0, 1]) until t = 1, Here, we leverage the variational principle of quantum at which point the system− should solve∈ the optimization mechanics and devise a strategy that emulates quan- problem (red dot in Fig.2(b)). We note that in our sim- tum annealing variationally. We dub our second vari- ulations, no training steps are taken at t = 1. Finally, ational neural annealing algorithm variational quantum similarly to VCA, we choose normalized RNN wave func- annealing (VQA). The latter is based on the variational tions [20, 21] as ans¨atze,giving the VQA algorithm access Monte Carlo (VMC) algorithm, whose goal is to simu- to exact Monte Carlo samples. late the equilibrium properties of quantum systems at zero temperature (see Methods Sec.VC). In VMC, the To gain theoretical insight on the principles behind a successful VQA simulation, we derive a variational ver- ground state of a Hamiltonian Hˆ is modeled through an sion of the adiabatic theorem [26]. Starting from a set of ansatz Ψ endowed with parameters λ. The varia- | λi assumptions, such as the convexity of the energy land- tional principle guarantees that the energy Ψ Hˆ Ψ h λ| | λi scape in the warm-up phase and close to convergence is an upper bound to the ground state energy of Hˆ , during annealing, as well as the absence of noise in the which we use to define a time-dependent objective func- energy gradients, we provide a bound on the total number tion E(λ, t) Hˆ (t) = Ψ Hˆ (t) Ψ to optimize the ≡ h iλ h λ| | λi of gradient descent steps Nsteps that guarantees the adia- parameters λ. baticity of the VQA algorithm as well as a success proba- The VQA setup, graphically summarized in Fig.2(b), bility of solving the optimization problem P > 1 . success − 4 (a) Here, is an upper bound on the overlap between the 100 variational wave function and the excited states of the ˆ 2 Hamiltonian H(t), i.e., Ψ (t) Ψλ < . We show that 10 1 |h ⊥ | i| Nsteps can be bounded as (see Appendix.B): 2 10 poly(N) poly(N) Nsteps . 2 2 3 O min(g(tn)) ≤ ≤ O min(g(tn)) 10
tn t /N { } { n} res (5) ✏ 4 10 0.99 0.01 VQA (N = 32) 1/t ± The function g(t) is the energy gap between the first / 1.02 0.02 VQA (N = 64) 1/t ± excited state and the ground state of the instantaneous 5 / 10 1.08 0.06 VQA (N = 128) 1/t ± ˆ / Hamiltonian H(t), N is the system size, and the set of 1.53 0.01 VCA (N = 32) 1/t ± / times tn is defined in Appendix.B. As expected for 6 1.66 0.02 10 VCA (N = 64) 1/t ± { } / hard optimization problems, the minimum gap typically 1.85 0.04 VCA (N = 128) 1/t ± decreases exponentially with system size N, which dom- 7 / 10 inates the computational complexity of a VQA simula- 101 102 103 104 tion, but in cases where the minimum gap scales as the Nannealing inverse of a polynomial in N, then the number of steps (b) Nsteps is also polynomial in N. Figure100 3. Variational neural annealing on a random Ising chain. Here we represent the residual energy per site res/N 1 vs the10 number of annealing steps Nannealing for both VQA and III. RESULTS VCA. The system sizes are N = 32, 64, 128. We use random 2 positive10 couplings Ji,i+1 [0, 1) (see text for more details). ∈ A. Annealing on random Ising chains The error bars represent the one s.d. statistical uncertainty 3 calculated10 over different disorder realizations [28]. /N
We now proceed to evaluate the power of VCA and res
✏ 4 10 0.96 0.03 VQA (N = 32) 1/t ± VQA. As a first benchmark, we consider the task of solv- / 1.01 0.05 VQA (N = 64) 1/t ± ing for the ground state the one-dimensional (1D) Ising 5 / 10 1.05 0.04 We take advantageVQA (N of = 128) the autoregressive1/t ± nature of the Hamiltonian with random couplings Ji,i+1, 6 / 1.32 0.05 RNN and sampleVCA (N 10 = 32)configurations1/t ± at the end of the 6 / 10 1.28 0.05 N 1 annealing, whichVCA allows(N = 64) us1 to/t accurately± estimate the − / 1.51 0.06 H = J σ σ . (6) model’s arithmeticVCA (N mean. = 128) The1/t typical± mean is taken over target i,i+1 i i+1 7 − 10 / i=1 25 instances of Htarget. X 101 102 103 104 First, we examine J sampled from a uniform dis- i,i+1 In Fig.3 we report the residualNannealing energies per site against tribution in the interval [0, 1). Here, the ground state the number of annealing steps Nannealing. As expected, configuration is given either by all spins up or down, and the residual energy is a decreasing function of Nannealing, the ground state energy is known exactly, i.e., EG = which underlines the importance of adiabaticity and an- N 1 − J [27]. nealing in our setting. In our examples, we observe that − i=1 i,i+1 We use a tensorized RNN ansatz without weight shar- the decrease of the residual energy of VCA and VQA is ingP for both VCA and VQA (see Methods Sec.VA). consistent with a power-law decay for a large number of We consider system sizes N = 32, 64, 128 and Ntrain = 5, annealing steps. Whereas VCA’s decay exponent is in the which suffices to achieve accurate solutions. For VQA, we interval 1.5 1.9, the VQA exponent is about 0.9 1.1. ˆ N x − − use a one-body driving term HD = Γ0 i=1 σˆi , where These exponents suggest an asymptotic speed-up com- x,y,z − σˆi are Pauli matrices acting on site i. To quantify pared to SA and coherent quantum annealing, where the the performance of the algorithms, we useP the residual residual energies follow a logarithmic law [29]. Contrary energy [11], to the observations in Ref. [29] where quantum annealing was found superior to SA, VCA finds an average residual = H E , (7) res h targetiav − G dis energy an order of magnitude more accurate than VQA where EG is the exact ground state energy of Htarget. We for a large number of annealing steps. use the arithmetic mean for statistical averages ... h iav Finally, we note that the exponents provided above are over samples from the models. For VCA it means that not expected to be universal and are a priori sensitive Htarget av Htarget λ, while for VQA the target Hamil- to the hyperparameters of the algorithms, e.g., learning h i ≈ h i N 1 z z tonian is promoted to Hˆ = − J σˆ σˆ target − i=1 i,i+1 i i+1 rate, model choice, number of training steps, optimizer, and H Hˆ . We consider the typical etc. Appendix.C provides a summary of the hyperpa- target av target λ P (geometric)h meani ≈ for h averagingi over instances of the tar- rameters used in our work. Additional illustrations of the get Hamiltonian, i.e., ... = exp( ln(...) ). The aver- adiabaticity of VCA and VQA, as well as of the anneal- dis h iav age in the argument of the exponential stands for arith- ing results for a chain with Ji,i+1 uniformly sampled from metic mean over different realizations of the couplings. the discrete set 1, +1 , are provided in Appendix.A. {− } 5
As a final note, the exponents provided above are not B. Edwards-Anderson model a Nsteps expected to be universal and are a priori sensitive to the 101 102 103 104 105 hyperparameters of the algorithms (e.g., learning rate, We now consider the two-dimensional (2D) Edwards- 0 number of memory units d , number of training steps 10 Anderson (EA) model, which ish a prototypical spin glass N , gradient descent optimizer, number of samples, 1 train 10 arrangedetc), on which a square may openlattice up with avenues nearest to boost neighbor the ran- perfor- dom interactions. The problem of finding ground states 2 mance of our algorithms. For reproducibility purposes, 10 of theAppendix. model hasD provides been studied a summary experimentally of the hyperparameters [12] and 3 numerically [11] from the annealing perspective, as well 10 used to produce the results shown here. /N as theoretically [2] from the computational complexity res 4 perspective. The EA model with open boundary condi- ✏ 10 CQO tions is given by B. Edwards-Anderson model 5 10 VQA 1.2 0.2 6 RVQA 1/t ± H = J σ σ , (8) 10 We now considertarget the two-dimensionalij i j Edwards- VCA 1/t2.0 0.2 − ± Anderson (EA) model, whichi,j is a prototypical spin-glass 7 hXi 10 model where a set of spins are arranged on a square 100 101 102 103 104 6 lattice with nearest neighbor random interactions. The where i, j denote nearest neighbors. The couplings Jij Nannealing are drawnproblemh i from of finding a uniform ground distribution states of the in model the interval has been b 0 ods [5]. The SK Hamiltonian HˆSK is given by [ 1, 1).studied In the experimentally absence of a [ longitudinal76] and numerically field, for [55 which, 56, 68] 10 solving− from the the EA annealing model is perspective,NP-hard, the as ground well as theoretically state can be [2] Figure 3. A comparison between VCA, VQA, RVQA, and 10 1 1 Jij z z from the computational complexity perspective. In this CQO for Edwards-Anderson (EA) on a 10 10 lattice. The HˆSK = ˆ ˆ , (10) found in polynomial time [2]. To find the exact ground ⇥ 2 pN i j section, we use the EA model as a benchmark to fur- residual energy per site vs. Nannealing for VCA, VQA and i=j state of each random realization, we use the spin-glass 2 X6 ther probe VCA and VQA, and compare them against RVQA.10 For CQO, we report the residual energy per site vs. server [30]. the number of optimization steps N . steps where Jij is a symmetric matrix such that each matrix standard heuristics, namely, SA and SQA implemented /N 3 We use a 2D tensorized RNN ansatz without weight 10 { }
via discrete-time path-integral Monte Carlo [55, 68]. The res element Jij is sampled from a gaussian distribution with sharing for the variational protocols (see Methods ✏ EA model is given by 4 mean 0 and variance 1. Sec.VA). For VQA, we use a one-body driving term 10 While the resultsSA in Fig. 3 do show an amelioration of Since VCA performed best in our previous examples, N x z z HˆD = Γ0 σˆ .ˆ Fig.4(a) shows the annealing re- i=1 i HEA = Jij ˆi ˆj , (8) the VQA5 performance,SQA including changing a saturating we use it to find ground states of the SK model for N = − 10 sults obtained on a system size Ni,j = 10 10 spins. VCA dynamics at long annealing time to a power-law like be- 100 spins. Here, exact ground states energies of the SK P hXi × VCA outperforms VQA and in the adiabatic, long-time anneal- havior, it6 appears to be insu cient to compete with the model are calculated using the spin-glass server [77] on where the sum runs over nearest neighbors, and the cou- 10 ing regime, it produces solutions three orders of magni- VCA scaling. This suggests1 the superiority2 3 of a thermally4 a total of 25 instances of disorder. To account for long- plings J are drawn independently from a uniform dis- 10 10 10 10 tude more accurateij on average than VQA. In addition, we driven variational emulation of annealing over a quantum distance dependencies between spins in the SK model, tribution in the range [ 1, 1]. In the absence of a longi- Nannealing investigate the performance of VQA supplemented with one. we use a dilated RNN that has log (N) =7layers tudinal field for which solving the EA model is NP-hard, 2 a fictitious Shannon information entropy [21] term that (see Methods Sec. VB) and we startd the annealinge at an the ground state can be found in polynomial time [2]. FigureTo 4. further Benchmarking scrutinize the the relevance two-dimensional of the annealing Edwards- ef- mimics thermal relaxation effects observed in quantum Figurefects in 4. VCA, Comparison we also between consider Simulated VCA with Annealing zero thermal (SA), initial temperature T0 = 2. We compare our results with For each random realization of the couplings Jij,weuse Anderson spin glass. (a) A comparison between VCA, VQA, annealing hardware [31]. This form of regularized VQA, Path-Integralfluctuations, Quantumi.e., setting MonteT = Carlo 0. Because (SQA) of with its intimateP =20 SA and SQA. For SQA, we start with an initial magnetic the spin-glass server [77] to obtain the exact ground state RVQA,trotter and slices, CQO and on VCA a 10 using100 a lattice2D tensorized by plotting pRNN the state resid- for here labelled (RVQA), is described by a pseudo free en- relation to the classical-quantum× optimization methods field 0 = 2, while for SA we use T0 = 2. energy. This feature makes the EA model an ideal bench- ualthe energy EA model per site on a vs 40Nannealing40 lattice.. For We CQO, report we the report residual the ˜ ˆ 2 To e↵ectively compare the three methods (i.e., SA, ergy costmark function for ourF method,λ(t) = particularlyH(t) λ T ( fort)S largeclassical system( Ψλ sizes.). residualenergyof Ref. energy per51 site, 79 per, as and a site function80 vs⇥, we the of call number the this number ofsetting optimization of annealing CQO. Fig. steps3 h i − | 2 | 3 SQA, and VCA), we first plot the residual energy per As in VCA,To simulate the pseudo our variational entropy term neuralSclassical annealing( Ψλ protocols,) at NstepsNshows. (b) that Comparisonfor CQO SA, VCA takes between and about SQA. SA, 10 SQAtraining with P steps= 20 start- trot- | | annealing f(1) =we 0 use provides a 2D tensorized a heuristic RNN approach (see Methods to count Sec. theVB num-) as an tering slices, from and random VCAusing parameters a 2D tensorized initialization RNN to ansatz reach close on a site as a function of Nannealing for VCA, SA and SQA ber ofansatz solutions without to H weightfor sharing. VQA We and implement RVQA. The themeth- re- 40 to40 1% lattice. accuracy. The annealingThe accuracy speed does is the not same further for SA, improve SQA (with P = 100 trotter slices). Here, the SA and SQA target × 5 residual energies are obtained by averaging the outcome sultsods in Fig. described4(a) do in show Sec. II anand amelioration?? with VQA of the implemented VQA andFig.when VCA.4, trained where up we to present 10 gradient the residual steps, indicating energiesthat per site the of 50 independent annealing runs, while for VCA we av- performance,using a one-body including driving changing term. a saturatingFig. 3 shows dynamics the anneal- againstCQO limit the of number VCA is of prone annealing to getting steps stuckN in local,which min- annealing erage the outcome of 106 exact samples from the an- at largeingN results obtainedto a power-law on a system like size behavior.N = 10 10 How- spins. isima. set In so comparison, that the speed VCA of and annealing VQA o is↵er the solutions same for orders SA, annealing ⇥ of magnitude more accurate at long annealing times, sug- nealed RNN. For all methods, we take the typical aver- ever,As it appears for the random to be insufficient Ising chains to in compete Sec. III A with, VCA the out- curateSQA on and average VCA. forWe a first large note number that our of annealing results confirm steps, gesting the importance of the annealing e↵ect in tackling age over 25 disorder instances. The results are shown in VCAperforms scaling (see VQA exponents and in the in adiabatic, Fig.4(a)). long-time This observa- annealing highlightingthe qualitative the behaviorimportance of SA of andannealing SQA in in Refs. tackling [55, 68 op-]. optimization problems. Fig. 5(a). As observed in the EA model in Fig. 4, we note regime, VCA produces solutions three orders of magni- timizationWhile at problems. short annealing times SA and SQA produce tion suggests the superiority of a thermally driven varia- that for fast annealing runs SA and SQA produce lower tude more accurate than VQA. In addition, we investi- lowerSince residual VCA displays energy solutions the best performance than VCA, we in the observe pre- tional emulation of annealing over a purely quantum one Since VCA displays the best performance in the pre- residual energy solutions than VCA, but we emphasize gate the performance of VQA supplemented with a ficti- thatvious VCA benchmarks, achieves we residual use it energies to demonstrate for large its annealing capabili- for this example. vious benchmarks, we use it to demonstrate its capa- that VCA delivers a lower residual energy compared to tious Shannon information entropy [47] term that mimics timeties on about a relatively three orders large of system magnitude with 40 smaller40 thanspins. SQA For To further scrutinize the relevance of the annealing bilities on a 40 40 spin system. For comparison,⇥ we SQA and SA as the total annealing time increases past thermal relaxation e↵ects observed in quantum anneal- andcomparison, SA. Notably, we× use the SA rate as well at which as SQA the with residualP = 20 energy trot- effects in VCA, we also consider VCA with zero ther- use SA as well as SQA. The SQA simulation uses the N 103. Likewise, we observe that the rate at ing hardware [78] and induces a thermal-like exploration improvester slices, with and take increasing the average the annealing energy across time all is signifi-trotter annealing ⇠ mal fluctuations, i.e., setting T0 = 0. Because of its path-integral Monte Carlo method [11] with P = 20 trot- which the residual energy improves with increasing the of the energy landscape during the quantum annealing cantlyslices, higherfor each in realization VCA than of SQA randomness and SA (see even Methods at rela- intimate relation to the classical-quantum optimization ter slices, and we report averages over energies across total annealing time is significantly higher in VCA than emulation. This form of regularized variational quan- tivelySec. VE short). In annealing addition, time. we average These observations the energy highlight obtained (CQO) methods of Refs. [32–34], we refer to this setting all trotter slices, for each realization of randomness (see SQA and SA. tum annealing (RVQA) is described by a free energy cost theafter advantages 25 annealing of solving runs on hard every optimization instance of problemsrandomness in as CQO. Fig.4(a) shows that CQO takes about 10 3 train- Methods Sec.VD). In addition, we average the energy A more detailed look at the statistical behaviour of the function: afor variational SA and SQA. space To compared average over to SA Hamiltonian and SQA paradigms. instances, ing steps to reach accuracies nearing 1%. The accuracy obtainedwe use afterthe typical 25 annealing mean over runs 25 on di every↵erent instance realizations of ran- for methods at long annealing times can be obtained from ˜ ˆ 2 does not furtherF (t)= improveH(t) upon T additional(t)Sclassical( training (t) ) up. to(9) domness for SA and SQA. To average over Hamiltonian the residual energy histograms separately produced by h i | | the three annealing methods. The results are shown in 105 gradient steps, which indicates that CQO is prone instances, we use the typical mean over 25 different re- each method, as shown in Fig. 5(e). For each instance C. Fully-connected spin glasses to getting stuck in local minima. In comparison, VCA alizations for the three annealing methods. The results Jij after the end of annealing, we represent the ob- { } and VQA offer solutions orders of magnitude more ac- are shown in Fig.4(b), where we present the residual tained residual energies in a histogram form. For the We now focus our attention on fully-connected spin three methods, we extract 103 residual energies for each glasses [2, 81]. We first focus on the Sherrington- disorder realization. Here, we observe that VCA is supe- Kirkpatrick (SK) model [82], which provides a concep- rior to SA and SQA, as it produces a higher density of tual framework for the understanding of the role of dis- low residual energies. This indicates that, even though order and frustration in widely diverse systems ranging VCA typically takes more annealing steps, it ultimately from materials to combinatorial optimization and ma- results in a higher chance of getting more accurate solu- chine learning. The combined e↵ect of disorder and long- tions to optimization problems than their SA and SQA range interactions in the SK model results in an energy counterparts. landscape characterized by a hierarchy of valleys with a We now focus on the Wishart planted ensemble number of local minima growing exponentially in the sys- (WPE), which is a class of zero-field Ising models with a tem size [81]. Together with the fact that many combina- first-order phase transition and tunable algorithmic hard- torial NP-hard problems can be thought of as the task of ness [83]. These problems belong to a special class of hard finding a ground state of a densely connected spin glass, problem ensembles whose solutions are known to the con- the properties above make fully connected spin glasses structor, which, together with the tunability of the hard- a suitable benchmark for heuristic optimization meth- ness, makes the WPE model an ideal tool to benchmark 6 energies per site against the number of annealing steps A more detailed look at the statistical behaviour of Nannealing, which is set so that the speed of annealing is the methods at large Nannealing can be obtained from the the same for SA, SQA and VCA. We first note that our residual energy histograms separately produced by each results confirm the qualitative behavior of SA and SQA method, as shown in Fig.5(d). The histograms contain in Refs. [11, 35]. While SA and SQA produce lower resid- 1000 residual energies for each of the same 25 disorder ual energy solutions than VCA for small Nannealing, we realizations. For each instance, we plot results for 1000 observe that VCA achieves residual energies about three SA runs, 1000 samples obtained from the RNN at the orders of magnitude smaller than SQA and SA for a large end of annealing for VCA, and 10 SQA runs including number of annealing steps. Notably, the rate at which the contribution from each of the P = 100 Trotter slices. residual energy improves with increasing Nannealing is sig- We observe that VCA is superior to SA and SQA, as it nificantly higher for VCA compared to SQA and SA even produces a higher density of low energy configurations. at relatively small number of annealing steps. This indicates that, even though VCA typically takes more annealing steps, it ultimately results in a higher chance of getting more accurate solutions to optimization C. Fully-connected spin glasses problems than SA and SQA. Note that for the SK model, the SQA histogram remain quantitatively the same for We now focus our attention on fully-connected spin 200 runs, and we report data of 10 runs only for fairness glasses [2, 36]. We first focus on the Sherrington- purposes compared to both SA and VCA. Kirkpatrick (SK) model [37], which provides a concep- We now focus on the Wishart planted ensemble tual framework for the understanding of the role of dis- (WPE), which is a class of zero-field Ising models with a order and frustration in widely diverse systems ranging first-order phase transition and tunable algorithmic hard- from materials to combinatorial optimization and ma- ness [38]. These problems belong to a special class of chine learning. The SK Hamiltonian is given by hard problem ensembles whose solutions are known a pri- ori, which, together with the tunability of the hardness, 1 J makes the WPE model an ideal tool to benchmark heuris- H = ij σ σ , (9) target −2 √ i j tic algorithms for optimization problems. The Hamilto- i=j N X6 nian of the WPE model is defined as where Jij is a symmetric matrix such that each matrix 1 α { } Htarget = J σiσj. (10) element Jij is sampled from a gaussian distribution with −2 ij i=j mean 0 and variance 1. X6 Since VCA performed best in our previous examples, Here J α is a symmetric matrix satisfying we use it to find ground states of the SK model for N = ij 100 spins. Here, exact ground states energies of the SK J α = J˜α diag(J˜) model are calculated using the spin-glass server [30] on − a total of 25 instances of disorder. To account for long- distance dependencies between spins in the SK model, we and use a dilated RNN ansatz that has log2(N) = 7 layers 1 d e J˜α = W W T. (see Methods Sec.VA) and set the initial temperature −N α α T0 = 2. We compare our results with SA and SQA. For SQA, we start with an initial magnetic field Γ = 2, while The term W is an N αN random matrix satisfy- 0 α × b c for SA we use T0 = 2. ing Wαtferro = 0 where tferro = (+1, +1, ..., +1) is the For an effective comparison, we first plot the resid- ferromagnetic state (see Ref. [38] for details about the ual energy per site as a function of Nannealing for VCA, generation of Wα). The ground state of the WPE model SA and SQA (with P = 100 trotter slices). Here, the is known (i.e., it is planted) and corresponds to the ferro- SA and SQA residual energies are obtained by averag- magnetic states tferro. Interestingly, α is a tunable pa- ing the outcome of 50 independent annealing runs, while rameter of hardness,± where for α < 1 this model displays for VCA we average the outcome of 106 exact samples a first-order transition, such that near zero temperature from the annealed RNN. For all methods, we take the the paramagnetic states are meta-stable solutions [38]. typical average over 25 disorder instances. The results This feature makes this model hard to solve with any an- are shown in Fig.5(a). As observed in the EA model, nealing method, as the paramagnetic states are numerous we note that SA and SQA produce lower residual energy compared to the two ferromagnetic states and hence act solutions than VCA for small Nannealing, but we empha- as a trap for a typical annealing method. We benchmark size that VCA delivers a lower residual energy compared the three methods (SA, SQA and VCA) for N = 32 and to SQA and SA as the total number of annealing steps α 0.25, 0.5 . 3 ∈ { } α increases past Nannealing 10 . Likewise, we observe We consider 25 instances of the couplings Jij and that the rate at which the∼ residual energy improves with attempt to solve the model with VCA implemented{ } using increasing N is significantly higher for VCA in a dilated RNN ansatz with log (N) = 5 layers and an annealing d 2 e comparison to SQA and SA. initial temperature T0 = 1. For SQA (P = 100 trotter 7
Figure 5. Benchmarking SA, SQA (P = 100 trotter slices) and VCA on the Sherrington-Kirkpatrick (SK) model and the Wishart planted ensemble (WPE). Panels (a),(b), and (c) display the residual energy per site as a function of Nannealing. (a) The SK model with N = 100 spins. (b) WPE with N = 32 spins and α = 0.5. (c) WPE with N = 32 spins and α = 0.25. Panels (d), (e) and (f) display the residual energy histogram for each of the different techniques and models in panels (a),(b), and (c), respectively. The histograms use 25000 data points for each method. Note that we choose a minimum threshold of 10 10− for res/N, which is within our numerical accuracy.
slices), we use an initial magnetic field Γ0 = 1, and for IV. CONCLUSIONS AND OUTLOOK SA we start with T0 = 1. In conclusion, we have introduced a strategy to com- bat the slow sampling dynamics encountered by simu- lated annealing when an optimization landscape is rough or glassy. Based on annealing the variational parameters of a generalized target distribution, our scheme — which We first plot the scaling of residual energies per site we dub variational neural annealing — takes advantage res/N as shown in Figs.5(b) and (c). Here we note that of the power of modern autoregressive models, which can VCA is superior to SA and SQA for α = 0.5 as demon- be exactly sampled without slow dynamics even when strated in Fig.5(b). More specifically, VCA is about a rough landscape is encountered. We implement varia- three orders of magnitude more accurate than SQA and tional neural annealing parameterized by a recurrent neu- SA for a large number of annealing steps. In the case ral network, and compare its performance to conventional of α = 0.25 in Fig.5(c), VCA is competitive where simulated annealing on prototypical spin glass Hamiltoni- it achieves a similar performance compared to SA and ans known to have landscapes of varying roughness. We SQA on average for a large number of annealing steps. find that variational neural annealing produces accurate We also represent the residual energies in a histogram solutions to all of the optimization problems considered, form. We observe that for α = 0.5 in Fig.5(e), VCA including spin glass Hamiltonians where our techniques achieves a higher density toward low residual energies typically reach solutions orders of magnitude more accu- 9 10 res/N 10− -10− compared to SA and SQA. For rate on average than conventional simulated annealing in α = 0.25∼ in Fig.5(f), VCA leads to a non-negligible the limit of a large number of annealing steps. density at very low residual energies as opposed to SA We emphasize that several hyperparameters, model, and SQA, whose solutions display residual energies or- hardware, and variational objective function choices can ders of magnitude higher. Finally, our WPE simulations be explored and may improve our methodologies. We support the observation that VCA tends to improve the have utilized a simple annealing schedule in our protocols quality of solutions faster than SQA and SA for a large and highlight that reinforcement learning can be used to number of annealing steps. improve it [39]. A critical insight gleaned from our exper- 8 iments is that certain neural network architectures were For disordered systems, it is natural to forgo the com- more efficient on specific Hamiltonians. Thus, a natu- mon practice of weight sharing [41] of W, U, b and c in ral direction is to study the intimate relation between Eqs. (12), (13) and use an extended set of site-dependent N the model architecture and the problem Hamiltonian, variational parameters λ comprised of Wn n=1 and N N N { } where we envision that symmetries and domain knowl- Un n=1 and biases bn n=1, cn n=1. The recursion edge would guide the design of models and algorithms. {relation} and the Softmax{ } layer are{ } modified to As we witness the unfolding of a new age for opti- hn = F (Wn[hn 1; σn 1] + bn), (15) mization powered by deep learning [40], we anticipate − − a rapid adoption of machine learning techniques in the and space of combinatorial optimization, as well as antici- pate domain-specific applications of our ideas in diverse pλ(σn σ | hn = F σn 1Tnhn 1 + bn , (17) A. Recurrent Neural Network Ans¨atze − − where σ| is the transpose of σ, and the variational pa- N N N N Recurrent neural networks model complex probability rameters λ are Tn , Un , bn and cn . { }n=1 { }n=1 { }n=1 { }n=1 distributions p by taking advantage of the chain rule This form of tensorized RNN increases the expressiveness of our ansatz as illustrated in Appendix.D. p(σ) = p(σ1)p(σ2 σ1) p(σN σN 1, . . . , σ2, σ1), (11) | ··· | − For two-dimensional systems, we make use of a 2D- dimensional extension of the recursion relation in vanilla where specifying every conditional probability p(σi σprobabilities. In their original To enhance the expressive power of the model, we pro- form, “vanilla” RNN cells [41] compute a new “hidden mote the recursion relation to a tensorized form state” hn with dimension dh, for each site n, following hi,j = F ([σi 1,j; σi,j 1]Ti,j[hi 1,j; hi,j 1] + bi,j) . (19) the relation − − − − Here, T are site-dependent weight tensors that have hn = F (W [hn 1; σn 1] + b), (12) i,j − − dimension 4 2d d . We also note that the coordinates × h × h where [hn 1; σn 1] is vector concatenation of hn 1 and (i 1, j) and (i, j 1) are path-dependent, and are given − − − − − a one-hot encoding σn 1 of the binary variable σn 1 [20]. by the zigzag path, illustrated by the black arrows in − − The function F is a non-linear activation function. From Fig.6(b). Moreover, to sample configurations from the this recursion relation, it is clear that the hidden state 2D tensorized RNNs, we use the same zigzag path as hn encodes information about the previous spins σn0 The joint probability distributionP pλ(σ) is given by (l) (l) (l) (l 1) (l) hn = F (Wn [hmax(0,n 2l 1); hn− ] + bn ). − − pλ(σ) = pλ(σ1)pλ(σ2 σ1) pλ(σN σ of spins and ... is the ceiling function. This means d e 2 that two spins are connected10 with a path whose length is bounded by (log (N)), which follows the spirit of the O 2 multi-scale renormalization3 ansatz [46]. For more details on the advantage of dilated10 RNNs over tensorized RNNs see Appendix.D. We finally note that for all the0 RNN200 architectures400 600 in 800 1000 our work, we found accurate results using theTraining exponential step b linear unit (ELU) activation function, defined as: (b) bx, 100 if x 0 , ELU(x) = ≥ (exp(x) 1, if x < 0 . 1 − 10 RNN with weight sharing 2 B. Minimizing the variational2 free energy 10 RNN with no weight sharing To implement the variational3 classical annealing algo- 10 rithm, we use the variational free energy 4 Fλ(T ) = Htarget10 λ TSclassical(pλ), (20) h i − 0 200 400 600 800 1000 where the target Hamiltonian Htarget encodesTraining the op- step timization problem and T is the temperature. More- (c) c over, Sclassical is the entropy of the distribution pλ. To (i) estimate Fλ(T ) we takec Ns exact samples σ pλ ∼ (i = 1,...,Ns) drawn from the RNN and evaluate Ns 101 1 (i) Fλ(T ) Floc(σ ), 2 ≈ F Ns