<<

Emergence of a finite-size-scaling function in the supervised learning of the Ising

Dongkyu Kim and Dong-Hee Kim Department of Physics and Photon Science, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea E-mail: [email protected]

Abstract. We investigate the connection between the supervised learning of the binary phase classification in the ferromagnetic and the standard finite- size-scaling theory of the second-order phase transition. Proposing a minimal one-free- parameter neural network model, we analytically formulate the supervised learning problem for the canonical ensemble being used as a training data set. We show that just one free parameter is capable enough to describe the data-driven emergence of the universal finite-size-scaling function in the network output that is observed in a large neural network, theoretically validating its critical point prediction for unseen test data from different underlying lattices yet in the same of the Ising criticality. We also numerically demonstrate the interpretation with the proposed one- parameter model by providing an example of finding a critical point with the learning of the Landau mean-field free energy being applied to the real data set from the uncorrelated random scale-free graph with a large degree exponent. arXiv:2010.00351v2 [cond-mat.stat-mech] 17 Feb 2021 2

1. Introduction

Understanding how an artificial neural network learns the state of matter is an intriguing subject for the applications of machine learning to various domains including the study of phase transitions in physical systems [1–5]. In a typical form of the multilayer perceptron, a neural network consists of layers of neurons that are connected through a feedforward network structure. The network can produce a mathematical function approximating any desired outputs for given inputs in principle [6–9], and one can optimize associated neural network parameters for a particular purpose in a data-driven way. In the supervised learning for the classification of data, which we particularly focus on in this study, the network is optimized to reproduce the labels of the already classified training input data. Remarkably, the neural network trained in a data-driven way often produces prediction with some accuracy even for unacquainted data of a similar type, which is not necessarily from the same data set or system given in the training. With various machine learning schemes being examined, the phase classification and the detection of a phase transition point have been extensively studied in classical and quantum model systems in recent years [10–50]. Because it is data-driven, rather than being based on the first principles, witnessing the empirical successes naturally leads to fundamental questions such as what specific information the neural network learns from the training data, to what extent and why it works even for unacquainted data or systems, how trustworthy such data-driven prediction can be, and most importantly, what is the mathematical foundation of the learnability. A general difficulty in addressing these questions is due to the nature of the “black box” model where one can hardly see inside because of high complexity generated by the interplay between a large number of neural network components. While the opaque nature may not harm its empirical usefulness especially when it works as a recommender, transparency can be crucial in the applications requiring extreme reliability where one wants logical justification of how it reaches such predictions. Explainable machine learning to deal with issues along this direction has attracted much attention in domains of scientific applications [51]. In the machine-learning detection of phase transitions and critical phenomena, there are increasing efforts in interpreting how the machine prediction works or designing transparent machines such as demonstrated in several previous studies [19–21, 32–38, 47]. Our goal in this paper is to interpret the predicting power of a neural network classifying the phases of the Ising model into the conventional physics language of the critical phenomena by proposing an analytically solvable model network having just one free parameter. The Ising model has been employed as a popular test bed of machine learning and is particularly useful for our purpose of discussing the learnability since it is a well- established model of the second-order phase transition in statistical physics. Our work is closely related to the seminal work by Carrasquilla and Melko [10] where a network with a single hidden layer of 100 neurons was trained with the pre-assigned phase labels of the Ising spin data that were given accordingly whether the data is sampled below 3 or above the known critical point of the training system. It turned out that the one trained for the square lattices was reusable to the unseen data from the triangular lattices without any cost of a new training, providing a good estimate of a critical point with a finite-size-scaling behavior. In our previous work [21], we investigated this reusability by downsizing the neural network. We found that the hidden layer could be as small as the one with just two neurons without loss of the prediction accuracy. In the downsized network model, we argued that its reusability to the systems in the different lattices is encoded in the system-size scaling behavior of the network parameters which is universal for any other lattices in the same universality class. In this paper, we further simplify the neural network model into a minimal setting with just one free parameter, providing a more transparent mathematical view on how the learning and prediction of the critical point occurs with the data of the Ising model. Despite the minimal design, we find that a single parameter is only necessary to capture the behavior observed in a large neural network that plays an essential role in the predicting accuracy and the reusability acquired from the training. The present one- parameter model improves the idea of the previous neural network models [10, 21] in terms of transparency and analytical interpretability. In our previous two-node model [21] that needs two free parameters, the fluctuations of the order parameter were ignored for the convenience of analytic treatment, which we find is important and now fully incorporated into the present derivation with the one-parameter model. On the other hand, the three-node model proposed previously by Carrasquilla and Melko [10] also had one free parameter but unfortunately was not analytically explored further. While the third hidden neuron is unnecessary in our model, our derivations and discussions can be directly applied to the previous three-node model because of the similarity between their functional forms of the output. Analytically minimizing the cross entropy for the supervised learning with the canonical ensemble at an arbitrarily large system size, we show that the trained network output becomes a universal scaling function of the order parameter with the standard critical exponents. This emergence of the scaling function is consistent with the empirical observation in a large neural network, which we find works as a universal kernel for the prediction with unseen test data from different lattices but belonging to the same Ising universality class. We demonstrate the operation of the one-parameter model by presenting the learning with the Landau mean-field free energy and its prediction accuracy of the critical point with the data from the uncorrelated random scale-free graph that belongs to the mean-field class. This paper is organized as follows. In Sec. 2, the procedures of the supervised learning are described. In Sec. 3, the implications of the scaling form emerging in the network output are discussed. In Sec. 4, the one-parameter neural network is presented with the derivation of the analytic scaling solution. The demonstration with the Landau mean-field free energy and the application to the data of the Ising model on the random scale-free graph is given in Sec. 5. The summary and conclusions are given in Sec. 6. 4

2. Supervised learning of the phase transition in the Ising model

We consider the classical ferromagnetic spin-1/2 Ising model with the nearest-neighbor exchange interactions without a magnetic field, which is described by the Hamiltonian P H = −J hi,ji sisj where the spin si at a site i takes the value of either 1 or −1, and the summation runs over all the nearest-neighbor sites hi, ji in the given lattices. The interaction strength J and the Boltzmann constant kB are set to be unity throughout this paper. The spin configuration s ≡ {s1, s2, . . . , sN } is given as an input to train the neural network, which is labeled as the ordered or disordered phase depending on whether the temperature associated with the data is lower or higher than the critical temperature Tc given for the supervision. The learning with the labeled data for the binary classification can be done by minimizing the cross entropy [56,57], X L(x) = − [Q(s) ln F (s; x) + (1 − Q(s)) ln (1 − F (s; x))] , (1) s with respect to the neural network parameters x. The function Q(s) returns the binary value 0 or 1 representing the label of the data s. The function F (s; x) is the output of the neural network, giving a value between 0 and 1 for an input s. The parameter x is to be optimized to maximize the likelihood between the distribution of the output F and the given distribution of the actual label Q. We prepare the data set of spin configurations at a given temperature by assuming an unbiased sampling with the Boltzmann probability in the canonical ensemble. The unbiased sampling is important to the mechanism of predicting a correct Tc with the trained network. While our main results are obtained from the analytic calculation of the cross entropy minimization based on our one-parameter model of the neural network, we also need the numerically generated data for the verification with real lattice geometries. Depending on the necessities in the numerical demonstration, we employ the Wang-Landau sampling method [52–54] computing the joint density of states or the Wolff cluster update [55] generating the spin configuration data. The prediction of a critical point is done based on how the network output behaves with the temperature associated with the test data given as an input. However, it comes with practical ambiguity arising from the fact that the value of the output fluctuates severely across the inputs at the temperatures near the critical point. While one might consider a smooth curve of an average hF i evaluated over many test inputs at a given temperature, one would still need a criterion or threshold to discriminate the order and disorder phases. There are previously suggested ways to obtain the location of a transition point, such as the scheme of the learning by confusion [11]. In the simplest case, where the training data thoroughly covers very fine grids of temperature across the transition as we consider here, one can just pick a certain cut such as hF i = 1/2 used in Ref. [10] to get the estimate of a transition temperature. Interestingly, the temperature corresponding to the cut showed a finite-size-scaling behavior with various sizes of the systems being examined [10], and it turned out that a specific value of the cut would not matter in the finite-size-scaling analysis [21]. 5

(a) 1.0 (b) 1.0 s1 s2 sN s1 s2 sN

t 1/N 1/N t

u L = 16 u L = 16 o o F 0.5 L = 20 F 0.5 L = 24

1 L = 24 1 L = 32 4 4 L = 32 L = 40 L = 40 2 L = 48 0.0 0.0 Fout 1 Fout 0.0 0.5 1.0 1.5 Fout 0.0 0.5 1.0 1.5 |m|L / |m|L / (c) (d) +a +b +1 1 black box model of F(s) m F(m) s1 +a b 1 0 s2 -1 0 1 / (e) s3 m F(s) = F (|m|L ) +1 + +1 1 m F(m)

sN +1 1 0 -1 0 1 m

Figure 1. Neural network models for the phase classification in the Ising model. The output F of the trained network is plotted as a function of the order parameter m computed for every individual input data of the spin configuration. (a) The network model with a single hidden layer of many neurons [10], which is trained here with 50 hidden neurons (see Ref. [21] for the detail of the data preparation). The marker and error bar indicate the average and range of the output at each m, respectively. (b) The previous two-node model trained with the Wang-Landau data [21]. (c) A schematic diagram of the structure producing the universal scaling function. Minimal models of the gray box in (c) are sketched in (d) and (e) with the sigmoid and Heaviside step activation functions, respectively.

3. Emergence of the universal scaling function in the network output

Revisiting the behavior of the previous large-size neural network trained with many hidden neurons [10], we observe a particular scaling form emerging in the network output P when it is plotted as a function of the order parameter m = i si/N for every individual data s. Figure 1(a) shows the training results with the data in the square lattices of N = L×L sites, and it turns out that these input-output curves for various system sizes fall onto a common curve in the scaling of |m|Lβ/ν with the critical exponents β and ν of the Ising universality class in two dimensions. Our previous two-node model [21] shows the same feature in the network output as shown in Fig. 1(b). β/ν The emergence of the scaling function F∗(|m|L ) in the network output reveals the simple explanation of how it finds a genuine critical point and why it works even for unseen test data generated in different lattices yet in the same Ising universality class of the critical exponents. For instance, two different trainings on the square and triangular lattices would give the neural networks with exactly the same scaling form of the output if it is plotted as a function of m. 6

With the test inputs of s being sampled from the probability distribution pL(s,T ) at a temperature T in the system of size L, the averaged network output is written R as hFLi = pL(s,T )FL(s) ds. If the test data set is prepared unbiasedly from the canonical ensemble, then the test data distribution can be expressed by the finite-size- β/ν β/ν 1/ν scaling form [58–60] of pL(s,T ) ≡ L p∗(|m|L , tL ) near the critical point Tc, where t ≡ T/Tc − 1 denotes the reduced temperature. Consequently, going across the critical point in the temperature axis, the averaged network output is finally rewritten as Z β/ν β/ν 1/ν β/ν 1/ν hFLi = dmL p∗(|m|L , tL )F∗(|m|L ) ≡ G∗(tL ) , (2) which is exactly what was observed numerically in the previous work [10]. 1/ν The function G∗(tL ) immediately indicates that the crossing point at t = 0 between the curves of different L’s gives an exact critical point Tc that is associated with p∗ of the test data set. In Ref. [10], the temperatures corresponding to hFLi = 1/2 were extrapolated toward an infinite L to predict Tc. Equation (2) explicitly shows that it is 1/ν −1/ν equivalent to following a constant G∗(tL ) that leads to the line of TL = Tc + aL , where the specific value of 1/2 does not play any role. Therefore, the key feature that guarantees the physically meaningful Tc prediction is whether or not the neural network properly approximates the scaling form F∗ of the output function. While the numerical training can be affected in practice by the detail of the data preparation such as the temperature grid spacing, the accurateness of the F∗ form determines the quality of the training in this particular learning problem with the data of the Ising model.

We emphasize that F∗ and p∗, the two constituents of Eq. (2), correspond to the two different data sets of the training and testing systems, respectively. Because the validity of Eq. (2) requires the same critical exponents for both of the training and testing data sets, this condition precisely defines the limit of the applicability, indicating that the prediction works for unseen data only in a particular group of the systems of the same criticality. For instance, as previously demonstrated in Ref. [21], the neural network trained for the Ising model in the square lattices fails to predict a correct transition point for the data of the three-dimensional lattices. If the test and training systems are not in the same universality class of the Ising critical exponents, a crossing point between the hFLi curves cannot be properly identified, and the extrapolation of TL at a fixed hFLi gives a different Tc depending on a choice of the value of hFLi. An important question is then how the neural network becomes approximating β/ν the universal kernel F∗(|m|L ) in the supervised learning of the binary phase labels. The functional form suggests that the network is trained to read the order parameter from the input, which is consistent with the previous observations [10, 18, 21]. Thus, we may be able to consider a picture that is schematically shown in Fig. 1(c), where m is assumed to be transmitted with the trivial link weights 1/N from the input to the hidden layer that belongs to the gray box. In the following section, we present a minimal neural network model of the gray box to show how the critical behavior of the training data leads to such scaling form of the output function. 7

4. One-free-parameter neural network model

4.1. Previous two-node model and further simplification In our previous work [21], we introduced a model network with the downsized hidden layer of just two hidden neurons receiving the explicit order parameter and demonstrated that it did not lose any predicting power and accuracy. In Fig. 1(b), we verify that it

indeed produces the expected form F∗ of its output, explaining the high accuracy of the critical point predicted by this downsized network in the previous work. However, despite the simple structure that allows analytic treatment to some extent, our previous two-node model is not mathematically transparent enough to see how the exact form of β/ν F∗(|m|L ) emerges from the learning. The technical difficulties in our previous analytic approach stem from the use of the sigmoid activation function assigned to both of the hidden and output neurons. While this is a common setting for a usual large-size neural network to be trained for a binary classifier, it leads to an output function that can be written as

FL(m) = f[4Λf(m − µL) + 4Λf(−m − µL) − 2ΛL], (3) 1 x where f(x) = 2 (1+tanh 2 ) is the sigmoid function, which is plotted in Fig. 1(b) with the parameters trained in the square lattices. We previously derived the system-size scaling −2β/ν 2β/ν behavior of the two neural parameters as µL ∼ L and ΛL ∼ L by ignoring the order parameter fluctuation in the input data set, which was a crude assumption since the fluctuations of m are severe near the critical point. With careful approximations with β/ν expansions for m around FL = 1/2, the form of F∗(|m|L ) might be justifiable, but a simpler and intuitive model would be certainly preferred to provide a more transparent picture of the learning process. Thus, we present a minimally simple one-parameter model of the gray box in Fig. 1(c) generating a very simple step-wise output function, F (m; ) = Θ(m + ) − Θ(m − ), (4) where Θ(x) is the Heaviside step function. The corresponding neural network structure is sketched in Fig. 1(e). The one with the sigmoid neuron given in Fig. 1(d) is its differentiable version, which is equivalent to the Heaviside one in the limit of large a and b with  ≡ b/a being finite. Note that the output neuron simply reduces the signal from the hidden layer without any activation function and bias being involved. It is β/ν trivial to see that the wanted form of F∗(|m|L ) would appear if the free parameter  −β/ν is given as L ∝ L , which we show below indeed occurs in the supervised learning with the cross entropy minimization.

4.2. Scaling solution of the free parameter

The binary phase label of the data is expressed as Q(T ) ≡ Θ(T − Tc) with the critical

point Tc being given for the supervision. The training data set is represented by the probability distribution function pL(m, T ) of the order parameter m at a temperature 8

T in the system of size L. The temperature range of the data set can be given as

T ∈ [Tl,Th] where Tl  Tc  Th. The cross entropy is then rewritten as Z Th Z ∞ L() = − dT dm pL(m, T )[Q(T ) ln F (m, ) + [1 − Q(T )] ln[1 − F (m, )]] . (5) Tl −∞ For the mathematical convenience, we first employ a differentiable version of the output function F shown in Fig. 1(d) that is written as 1  a(m + ) a(m − ) F (m,  ≡ b/a) = tanh − tanh . (6) 2 2 2 While the two parameters a and b appear in this expression, it is effectively a one- parameter model because the output function essentially depends on the ratio  ≡ b/a in the limit of large a and b that we assume. Taking the derivative of L with respect to  in the limit of large a, we obtain an integral equation, Z Th Z ∞ Z Tc Z  dT dm pL(m, T ) = dT dm pL(m, T ), (7) Tc  Tl 0 where we assume that pL is an even function of m as we consider the unbiased preparation of the training data set preserving the Ising symmetry. Equation (7) is solvable for the system-size scaling behavior of  under ideal training conditions where the training data is in the canonical ensemble and uniformly available at all temperatures. While such training data set is typically considered in the Monte Carlo simulations, it also allows a fully analytic treatment based on the standard finite-size- scaling ansatz of pL near the critical point Tc. While Th and Tl can be given to be arbitrarily far from Tc, the specific values of R ∞ Th and Tl are unimportant because  pL(m, T ) dm is only meaningful in the critical area. In the right hand side of Eq. (7), pL is sharply peaked at |m| = 1 deep in the R  ordered phase (T  Tc), leading to a negligibly small value of 0 pL dm if  is much less than one. On the other hand, in the left hand side, at T  Tc in the disordered phase, p is governed by the central limit theorem, and then the integral R ∞ p dm L √ √  L 2 decays asymptotically as exp(−N )/( N) if N increases with the number√ of spins N. Therefore, presumed that  decreases with the system size L while N increases

with L, we can replace Tl and Th with effective bounds of the critical area. The width of the critical area decreases with increasing L, suggesting that one must consider the finer grids of temperature for the data of the larger system in a numerical approach. In the analytic calculations, all temperatures are available in the training data set. Considering the temperature integration over the critical area of t ∈ [−δt, δt], where

t ≡ (T − Tc)/Tc denotes the reduced temperature, the probability distribution function β/ν β/ν 1/ν pL(m, T ) near Tc can be expressed as pL ≡ L p∗(mL , tL ) by the standard finite- size-scaling ansatz. Then, Eq. (7) can be rewritten as Z δtL1/ν Z ∞ Z 0 Z Lβ/ν dτ dx p∗(x, τ) = dτ dx p∗(x, τ), (8) 0 Lβ/ν −δtL1/ν 0 where the change of variables is performed for x ≡ mLβ/ν and τ ≡ tL1/ν. This equation holds for an arbitrarily large system to be trained. With the scale invariance of the 9

equation for an arbitrary L being imposed, the width of the critical area is clarified to be δt ∼ L−1/ν, and more importantly, the neural parameter has to behave as  ∼ L−β/ν. This system-size scaling behavior of  directly leads to the expected form of the network β/ν output F∗(|m|L ) that we have discussed above. −β/ν One can verify that√ the resulting scaling solution of  ∼ L indeed validates the assumption that N would increase as L√increases, which we have used in the derivation. In d dimensions, it is rewritten as N ∼ Ld/2−β/ν. It is easy to see that the condition (d/2 − β/ν) > 0 holds from the hyperscaling relations that indicates its equivalence to the γ > 0 of the susceptibility divergence. Instead of using the differentiable version of F and taking the limit of a infinite a, one can also introduce a small shift 0 < δ  1 directly to Eq. (4) to avoid the undefined evaluations of ln F and ln(1 − F ) as F (m, ) = (1 − δ)[Θ(m + ) − Θ(m − )] + δ , (9) which can be trivially implemented in the model sketched in Fig. 1(e) by adjusting the link weights and the bias of the output neuron. After ln δ is factored out, the cross entropy can be rewritten as L() Z Th Z ∞ Z Tc Z  = dT dm pL(m, T ) + dT dm pL(m, T ) , (10) 2| ln δ| Tc  Tl 0 where δ does not affect the optimization. Following the procedures that we have shown above, the minimization of the cross entropy can then be rewritten with the temperature integration range effectively being limited to the area around a given Tc as Z τo Z 0 β/ν β/ν p∗(L , τ) dτ = p∗(L , τ) dτ , (11) 0 −τo 1/ν where τo ≡ δtL denotes the effective width of the critical area normalized with the system size. The scale-invariant solution of this equation that holds for an arbitrarily large L provides the same finite-size-scaling behavior of  ∼ L−β/ν and thereby produces the universal kernel of the F∗ function of the network output. In addition, we numerically verify the derived scaling solution of  ∼ L−β/ν in the system on square lattices. The input data for the training is provided from the estimate of the probability distribution pL(m, T ) that is directly given by the Wang-Landau sampling of the joint density of states [21, 52, 53]. Since the Wang-Landau estimate of pL provides unlimited access to all temperatures, one can numerically evaluate Eq. (5) and perform the minimization. Figure 2 shows the system-size scaling of the trained parameter b/a for the model of Fig. 1(d) and  for the model of Figs. 1(e), respectively. Because m is discrete in a finite system, the use of the Heaviside step function causes a range of  corresponding to the same value of the cross entropy. The error bars in Fig. 2(b) present such ranges of , showing that the degeneracy diminishes as the system size gets larger. Both of the two numerical training results show excellent agreement with our derivation of  ∼ L−β/ν with the critical exponent β/ν = 1/8 being in the universality class of the two-dimensional Ising model. 10

(a) -0.2 (b)

-0.2 -0.3 ) ) a / ( b ( n l n

l -0.3 -0.4 b/a L 1/8 L 1/8 WL training -0.4 WL training -0.5 8 16 32 64 8 16 32 64 L L

Figure 2. Parameters of the minimal neural network models determined by training with the Wang-Landau data in the square lattices. The scaling behavior is plotted as a function of linear dimension L of the training system for (a) the ratio b/a of the model in Fig. 1(d) and (b) the parameter  of the model in Fig. 1(e).

4.3. Fluctuations in the input layer extracting the order parameter The minimal network model presented above has the designed input layer with the P fixed link weight wi = 1/N, transmitting the explicit order parameter m = i wisi from the input of the spin configuration s ≡ {si}. This design provides the simplest model that corroborates the observation from the large-scale network trained under null hypothesis [10, 21]. In a practical point of view, discarding the uncertainty of the link weights helps to remove the training noises and the numerical overfitting as implied in the comparison of the output functions between Fig. 1(a) and Fig. 1(b), reducing the error in the Tc estimate eventually. Still, it is an interesting question to ask how stable this ideal model of the input weight wi = 1/N would be if variations on wi are allowed in the training and also how the fluctuations of wi would depend on the detail of the training data preparation.

Introducing an additional set of undetermined parameters w = {wi} for m = P  ∂L  wisi in the minimal model, we obtain wi by solving the equation = 0 with i ∂w =L the parameter  being fixed at the ideal training solution that we have already obtained from  ∂L  = 0. We employ the stochastic gradient method in the scheme of the ∂ (wi=1/N) online learning [61] in combination with the Wolff cluster update algorithm to sample the data of spin configurations in the square lattices. The temperature grids of the

training data are set in the range of [Tc/2, 3Tc/2] with the spacing of ∆T . Figure 3 displays how the probability distribution of the weights depends on the system size and the temperature grid spacing of the training data prepared. It turns out that the resulting distribution P (w) is bell-shaped with a well-defined average at the ideal value ofw ¯ = 1/N. The ratio of the standard deviation and the average of

P (w) represents the magnitude of fluctuations in wi, which increases as the system size gets larger but decreases as the temperature grid space ∆T gets smaller. This observation hints a proper data preparation for a more accurate prediction of

Tc in the numerical training. The observed behavior of P (w) implies that the training of 11

(a) (b) 2 0.6 L = 24 4 w = 1/L L = 32 L = 48 2 )

0.4 2 w L ( w P 0 0.2

-2 0.0 -9 -4 1 6 11 10 20 30 40 50 wL 2 L (c) (d) 4 T = 0.0002 L = 24 T = 0.0001 1.0 3 w w / / 0.8 w 2 w

1 0.6

0 0.4 10 20 30 40 50 10 4 10 3 L T

Figure 3. Weight fluctuations of the input links extracting the order parameter. The statistics of the link weights w connecting input and hidden layers is examined by using the stochastic training in the model of the Heaviside neurons in Fig. 1(e) with the data of spin configurations in the square lattices. (a) The system-size dependence of the link weight distribution P (w) for a given temperature spacing at ∆T = 0.0001 of the Monte Carlo training data. (b) The average w (symbols) and standard deviation σw (error bars) plotted as a function of the system size L for the training data prepared with ∆T = 0.0001. The panels (c) and (d) indicate that the fluctuations of w decrease as the temperature spacing ∆T gets smaller. the larger system would need the finer grids in temperature for the training data set to suppress the noises in the final form of the network output F ∗(|m|Lβ/ν) as it affects the accuracy as a Tc locator. This test indicates the importance of the thorough coverage of the critical area in the training data set, which in some sense makes the machine learning less magical but is reasonable in terms of statistical physics because finding a genuine critical point cannot be separated from the critical behavior of the system.

5. Learning the Landau mean-field theory of the Ising model

For the demonstration of how the prediction works on unseen data from a different underlying geometry, we choose the Landau mean-field free energy as a generator of the training data and then apply it to the test data produced on a scale-free graph as an underlying geometry of the Ising model. By using the analytically trained one- parameter network model, we attempt to locate the critical point of the Ising model in 12

the random scale-free graph with a large degree exponent that is known to be in the mean-field class [62–64]. For the order parameter m with the Ising symmetry, the Landau mean-field free

energy per spin can be written at T near the critical point Tc as 2 4 f(m, t) = f0 + a2tm + a4m , (12)

where t ≡ (T − Tc)/Tc denotes the reduced temperature, and a2 are a4 are positive constants. In the system of N spins, the corresponding probability distribution of the order parameter m near Tc is written directly from the Landau free energy as 1/4 2 4 pN (m, t) ∝ N exp[−N(a2tm + a4m )/Tc] , (13) which leads to the finite-size-scaling form,

1/4 1/4 1/2 pN (m, t) = N p∗(mN , tN ) . (14) One can verify the mean-field exponents β = 1/2 andν ¯ = 2 in the comparison with β/ν¯ β/ν¯ 1/ν¯ the standard finite-size-scaling ansatz pN = N p∗(mN , tN ). We do not consider any possibility of the logarithmic corrections, and we strictly limit our demonstration in the class of systems that is described by the Landau free energy in Eq. (12).

Provided the scaling form of the probability distribution pN (m, t) of a training data set, it is now straightforward to obtain the scaling behavior of the parameter  in the trained network. Following the analytic minimization of the cross entropy given in Sec. 4, 1/4 β/ν¯ we can write down the scaling solution as N = 0N by just putting N instead β/ν of L . The constant 0 can be determined by the detail of pN (m, t) and the range of

temperature of the training data set. However, a particular value of 0, which we just set to be the unity in the calculation below, is unimportant to the performance of the neural β/ν¯ network because it does not affect the form of the output function FN (m) = F∗(|m|N ) that works as a universal kernel.

We examine the accuracy of the Tc prediction with the data from the uncorrelated random scale-free graph model [65]. In the graph of underlying vertices of the Ising spins, a vertex is randomly connected to some number of other vertices by the edges representing the exchange interactions between the residing spins. The number of the edges from a vertex, referred to as the degree, follows a power-law distribution p(k) ∼ k−γ with a degree exponent γ. It is known that when γ > 5, the Ising model on this scale-free graph exhibits the mean-field critical exponents [62–64]. Here we examine the scale-free graph with the degree exponent γ = 6.5 and the minimum degree 4. The degree of each site is given as the greatest integer less than or equal to the value drawn randomly from the power-law distribution. The test input data set consists of 10,000 spin configurations per a graph sample obtained from the Wolff cluster updates at each temperature, and 50,000 random graph samples are included in the test data set. Figure 4 shows the output of the one-parameter model averaged over the test inputs of the Ising spin data prepared on the scale-free graph geometry. The crossing point between the curves of different system sizes provides the estimate of the critical point

Tc = 3.6595(5), which is in good agreement with the standard detection using the 13

(a) 0.6 (b) 0.4 N = 16000 N = 32000 N = 64000 N = 128000 0.5 N = 256000 0.3 4 F u 0.4

0.2 0.3

3.65 3.66 3.67 3.65 3.66 3.67 T T

Figure 4. Critical point detection by applying the mean-field-trained neural network to the data of the Ising model on the uncorrelated scale-free graph with the degree exponent γ = 6.5. (a) The network output hF i averaged over the inputs of Monte Carlo data sampled at each temperature. (b) The fourth-order cumulant as a function of temperature given for comparison. Each data point is an average over the ensemble of 50,000 random graph samples, and the error bars (not shown) are smaller than the marker size. The vertical dotted lines indicate the exact location of the critical point.

fourth-order cumulant that gives Tc = 3.6601(2). The exact critical point for the uncorrelated random scale-free graph with γ > 5 was derived previously in Ref. [62] 2 2 as Tc = 2/ ln[hk i/(hk i − 2hki)], which becomes Tc ' 3.6599 for our degree sequences of the scale-free graph samples examined. These results demonstrate the operation of Eq. (2) in the mean-field regime with the training of the Landau free energy and the Monte Carlo test data generated on the random scale-free graph. In the previous work [21], we did similar tests with the two-node model that has two free parameters, presenting the cases where the training and test data sets are in the same and different universality classes. The validity of

the Tc prediction is only guaranteed when the training and test data sets are in the same universality class while the underlying geometries are not necessarily the same.

We argue that Eq. (2) is the simple physical basis that explains the valid Tc prediction with the supervised learning on the Ising model, which can be implemented by using just a single free parameter of the neural network.

6. Conclusions

We have investigated the connection between the data-driven prediction of a critical point based on the supervised learning in the ferromagnetic Ising model and the standard finite-size-scaling theory of the second-order phase transition. It turns out that the β/ν scaling form F∗(|m|L ) emerging in the network output is the source of the predicting power, which works as a universal kernel guaranteeing a physically legitimate estimate of a critical point for unseen test data from different lattices but in the same universality class with the training data. We have shown that a minimal network with just one free parameter suffices to model such emergence of the scaling form in the minimization of the 14

cross entropy. For the training data unbiasedly sampled from the canonical ensemble, we have derived the analytic scaling solution of the one-parameter model that leads to the expected scaling form of the output function. For the numerical demonstration, we have considered the Landau mean-field energy as a generator of the training data and verified that it accurately locates the critical point on the random uncorrelated scale-free graph that belongs to the mean-field class. While we have shown that the conventional finite-size-scaling ansatz can be implemented in a very simple data-driven way with just one parameter being learned, the model benefits from the simple order parameter structure of the Ising model that is extracted easily as empirically observed in the large-size neural network. The possible direction for future studies may include the generalization for more complex symmetry of an order parameter and the interpretation of the learning in a broader range of phase transitions and critical phenomena in complex systems.

Acknowledgments

This work was supported from the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science and ICT (NRF- 2019R1F1A1063211) and also from a GIST Research Institute (GRI) grant funded by the GIST.

References

[1] Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L and Zdeborov´aL 2019 Rev. Mod. Phys. 91 045002 [2] Ohtsuki T and Mano T 2020 J. Phys. Soc. Jpn. 89 022001 [3] Zdeborov´aL 2020 Nat. Phys. 16 602–604 [4] Carrasquilla J 2020 Adv. Phys. X 5 1797528 [5] Bedolla-Montiel E A, Padierna L C and Casta˜neda-PriegoR 2020 J. Phys.: Condens. Matter 33 053001 [6] Cybenko G 1989 Math. Control Signals Syst. 2 303–314 [7] Hornik K, Stinchcombe M and White H 1989 Neural Netw. 2 359–366 [8] Hornik K 1991 Neural Netw. 4 251–257 [9] Leshno M, Lin V Ya, Pinkus A and Schocken S 1993 Neural Netw. 6 861–867 [10] Carrasquilla J and Melko R G 2017 Nat. Phys. 13 431–434 [11] van Nieuwenburg E P L, Liu Y-H and Huber S D 2017 Nat. Phys. 13 435–439 [12] Wang L 2016 Phys. Rev. B 94 195105 [13] Ohzeki M 2016 J. Phys. Soc. Jpn. 85 123706 [14] Ohtsuki T and Ohtsuki T 2017 J. Phys. Soc. Jpn. 86 044708 [15] Tanaka A and Tomiya A 2017 J. Phys. Soc. Jpn. 86 063001 [16] Hu W, Singh R R P and Scalettar R T 2017 Phys. Rev. E 95 062112 [17] Wetzel S J 2017 Phys. Rev. E 96 022140 [18] Wetzel S J and Scherzer M 2017 Phys. Rev. B 96 184410 [19] Ponte P and Melko R G 2017 Phys. Rev. B 96 205146 [20] Suchsland P and Wessel S 2018 Phys. Rev. B 97, 174435 [21] Kim D and Kim D-H 2018 Phys. Rev. E 98 022138 [22] Iso S, Shiba S and Yokoo S 2018 Phys. Rev. E 97 053304 15

[23] Huembeli P, Dauphin A and Wittek P 2018 Phys. Rev. B 97 134109 [24] Liu Y-H and van Nieuwenburg E P L 2018 Phys. Rev. Lett. 120 176401 [25] Beach M J S, Golubeva A and Melko R G 2018 Phys. Rev. B 97 045207 [26] Vargas-Hern´andezR A, Sous J, Berciu M and Krems R V 2018 Phys. Rev. Lett. 121 255702 [27] Mills K and Tamblyn I 2018 Phys. Rev. E 97 032119 [28] Morningstar A and Melko R G 2018 J. Mach. Learn. Res. 18 1–17 [29] Li C-D, Tan D-R and Jiang F-J 2018 Ann. Phys. NY 391 312–331 [30] Kashiwa K, Kikuchi Y and Tomiya A 2019 Prog. Theor. Exp. Phys. 2019 083A04 [31] Zhang W, Liu J and Wei T-C 2019 Phys. Rev. E 99, 032142 [32] Casert C, Vieijra T, Nys J and Ryckebusch J 2019 Phys. Rev. E 99 023304 [33] Zhang W, Wang L and Wang Z 2019 Phys. Rev. B 99 054208 [34] Greitemann J, Liu K and Pollet L 2019 Phys. Rev. B 99 060404(R) [35] Liu K, Greitemann J and Pollet L 2019 Phys. Rev. B 99 104410 [36] Greitemann J, Liu K, Jaubert L D C, Yan H, Shannon N and Pollet L 2019 Phys. Rev. B 100 174408 [37] Liu K, Sadoune N, Rao N, Greitemann J and Pollet L 2020 arXiv:2004.14415 [38] Rao N, Liu K and Pollet L 2020 arXiv:2007.07000 [39] Kiwata H 2019 Phys. Rev. E 99 063304 [40] Efthymiou S, Beach M J S and Melko R G 2019 Phys. Rev. B 99 075113 [41] Li Z, Luo M and Wan X 2019 Phys. Rev. B 99 075418 [42] Dong X-Y, Pollmann F and Zhang X-F 2019 Phys. Rev. B 99 121104(R) [43] Canabarro A, Fanchini F F, Malvezzi A L, Pereira R and Chaves R 2019 Phys. Rev. B 100 045129 [44] Gannetti C, Lucini B and Vadacchino D 2019 Nucl. Phys. B 944 114639 [45] Lee S S and Kim B J 2019 Phys. Rev. E 99 043308 [46] Shiina K, Mori H, Okabe Y and Lee H K 2020 Sci. Rep. 10 2177 [47] Bl¨ucher S, Kades L, Pawlowski J M, Strodthoff N and Urban J M 2020 Phys. Rev. D 101 094507 [48] D’Angelo F and B¨ottcher L 2020 Phys. Rev. Res. 2 023266 [49] Munoz-Bauza H, Hamze F and Katzgraber H G 2020 J. Stat. Mech. 2020 073302 [50] Veiga R and Vicente R 2020 arXiv:2006.10176 [51] Roscher R, Bohn B, Duarte M F and Garcke J 2020 IEEE Access 8 42200–42216 [52] Wang F and Landau D P 2001 Phys. Rev. Lett. 86 2050–2053 [53] Wang F and Landau D P 2001 Phys. Rev. E 64 056101 [54] Landau D P, Tsai S-H and Exler M 2004 Am. J. Phys. 72 1294–1302 [55] Wolff U 1989 Phys. Rev. Lett. 62 361–364 [56] Nielsen M A 2015 Neural Networks and Deep Learning (Determination Press) [57] Goodfellow I, Bengio Y and Courville A 2016 Deep Learning (Cambridge, MA: MIT Press) [58] Binder K 1981 Z. Physik B - Condensed Matter 43 119–140 [59] Bruce A D 1981 J. Phys. C: Solid State Phys. 14 3667–3688 [60] Nicolaides D and Bruce A D 1988 J. Phys. A: Math. Gen. 21 233–244 [61] Bottou L 1998 Online Algorithms and Stochastic Approximations Online Learning and Neural Networks (Cambridge: Cambridge University Press) [62] Dorogovtsev S N, Goltsev A V and Mendes J F F 2002 Phys. Rev. E 66 016104 [63] Goltsev A V, Dorogovtsev S N and Mendes J F F 2003 Phys. Rev. E 67 026123 [64] Hong H, Ha M and Park H 2007 Phys. Rev. Lett. 98 258701 [65] Catanzaro M, Bogu˜n´aM and Pastor-Satorras R 2005 Phys. Rev. E 71 027103