Ads/CFT As a Deep Boltzmann Machine

OU-HET-1003

AdS/CFT as a deep Boltzmann machine

Koji Hashimoto1 1Department of Physics, Osaka University, Toyonaka, Osaka 560-0043, Japan (Dated: March 13, 2019) We provide a deep Boltzmann machine (DBM) for the AdS/CFT correspondence. Under the phi- losophy that the bulk spacetime is a neural network, we give a dictionary between those, and obtain a restricted DBM as a discretized bulk scalar ﬁeld theory in curved geometries. The probability distribution as training data is the generating functional of the boundary quantum ﬁeld theory, and it trains neural network weights which are the metric of the bulk geometry. The deepest layer im- plements black hole horizons, and an employed regularization for the weights is an Einstein action. A large Nc limit in holography reduces the DBM to a folded feed-forward architecture. We also neurally implement holographic renormalization into an autoencoder. The DBM for the AdS/CFT may serve as a platform for studying mechanisms of spacetime emergence in holography.

I. INTRODUCTION

Deep Boltzmann machines [1] are a particular type of neural networks in deep learning [2–4] for modeling probabilistic distribution of data sets. They are equipped with deep layers of units in their neural network architecture, and are a generalization of Boltzmann machines [5] which are one of the fundamental models of neural networks. Deepening the architecture enlarges the representation power of the models, and recent advances in training deep models in machine learning were initiated by analogues of the deep Boltzmann machines. The neural network of a deep Boltzmann machine con- sists of visible units and hidden units. On those units binary variables live, and they interact with each other under a Hamiltonian called an energy function. Thus basically the deep Boltzmann machine is an Ising model in which spins only at a boundary layer are visible (ob- servable), and the Hamiltonian allows inhomogeneity and FIG. 1: Top: the AdS/CFT correspondence. The horizontal nonlocality. For a given probability distribution of the coordinate z is the emergent spatial direction. Bottom: a deep observed spin configurations at the boundary layer, Ising Boltzmann machine. Circles are units and lines are weights. bond strengths (called “weights”) in the model Hamilto- Double circles are visible units. It has a layered structure, nian are trained to approximate the given distribution; and “deep” means that they have many layers toword the that is the deep learning of the deep Boltzmann machine. right direction in the figure. The training determines the weights automatically, and a structure of the Hamiltonian emerges. Efficient algorithms for the training [6] accelerated the progress in deep learning. the bulk geometry. We detail the relation between the arXiv:1903.04951v1 [hep-th] 12 Mar 2019 In this paper we study a relation between the deep two schemes both of which are renowned independently Boltzmann machines and the AdS/CFT correspondence in different sciences. [7–9] in quantum gravity. The AdS/CFT correspondence A motivation to bring them together also comes from is a holographic duality between a (d + 1)-dimensional recent progress in discretization of the AdS/CFT. Pop- quantum gravity and a d-dimensional quantum field the- ular toy models of the AdS/CFT a la quantum infor- ory (QFT) without gravity. The latter lives at the bound- mation use MERA [10] and other tensor networks [11]. ary of the gravitational spacetime of the former. From In the first place, quantum gravity has a long history the viewpoint of the QFT, the direction perpendicular of Regge calculus [12] and dynamical triangulation [13] to the boundary surface is an “emergent” space direc- where spacetimes are approximated by networks. For for- tion. Therefore, the aforementioned structure of the deep mulating quantum gravity, we need dynamical network Boltzmann machines suits the scheme of the AdS/CFT, whose structure is determined in a self-organized manner. once we identify their visible layers with the QFT, and In that view, neural network architecture may provide a the hidden layers as the bulk spacetime. See Fig. 1. The novel platform for quantum gravity and the emergent trained weights are interpreted as the metric function of spacetime. 2

We show that the AdS/CFT correspondence naturally II. BRIEF REVIEW OF BOLTZMANN fits the scheme of the deep Boltzmann machines, where MACHINE the bulk spacetime geometry is reinterpreted as a sparse neural network. We construct explicitly a deep Boltz- Boltzmann machines in machine learning are a net- mann machine architecture which represents an example work model for giving a probabilistic distribution P (vi) of the AdS/CFT correspondence. Previously, in [17, 18] of the input variables vi ∈ {0, 1} for i = 1, ··· , n. The a possibility of relating hidden variables of Boltzmann probability P (vi) of the Boltzmann machine is defined by machines to bulk fields was mentioned [40]. In [14], entanglement feature of a free fermion chain was trained P (vi) = exp [−E(vi)] (1) at a random tensor network as a deep Boltzmann machine. The holographic interpretation for feed-foward with an energy function E(vi) given by deep neural networks was proposed and studied in [15, 16] X X for training QFT and QCD linear response functions, E(vi) ≡ aivi + wijvivj. (2) and the obtained emergent bulk spacetime for large Nc i i6=j QCD exhibits interesting physical properties and com- putes other observables as predictions. While our work Here ai and wij are real parameters, and are called bias here naturally relates to these work, in this paper we con- and weight, respectively. The structure of the Boltzmann centrate on deep Boltzmann machines as an AdS/CFT machine is specified by a network graph (see Fig. 2 Left) correspondence. in which n units each of which takes the value vi are denoted by circles while the weights w are by lines connecting the circles. Obviously, in physics P is a Boltz- mann distribution of a canonical ensemble of a classical Although at first sight the two schemes look similar, Ising model, after which the name Boltzmann machine in details they possess different characteristics. For ex- was named. ample, since deep Boltzmann machines have constraints As (2) is quadratic in vi, even if one varies the biases for their architecture and trainability, we need a care- and the weights, the class of the probability distributions ful discretization of bulk field theories. In addition, obtained by (2) is quite limited. Making the network the AdS/CFT correspondence is well-understood at the deep enlarges the representation power of the architec- large Nc limit, while that limit has not been studied in ture. Adding hidden variables hi, we have the Boltzmann machines. Furthermore, generalization in X deep learning owes to degenerate sets of trained weights, P (vi) = exp [−E(vi, hi)] (3) while in the AdS/CFT that degeneration is not expected. hi∈{0,1} In this paper we address these basic questions raised in relating the two schemes. We provide a concrete expres- with the energy function sion of a deep Boltzmann machine which satisfies the X X E(v , h ) ≡ (a v + b h ) + w v h . (4) standard constraints, and find that the large Nc limit i i i i i i ij i j brings the Boltzmann machine to a folded feed-forward i ij architecture. We propose an Einstein action as a regu- The layer consisting of v (h ) is called visible (hidden) larization of training to distinguish sets of weights to be i i layer, and here, weights connecting units in the same interpreted as a smooth spacetime. layer are set to zero, so the network graph is restricted to have only limited connection lines (see Fig. 2 Center). This (4) is called a restricted Boltzmann machine. The organization of this paper is as follows. In Sec. II, Suppose one has measured events and obtained many we briefly review deep Boltzmann machines. In Sec. III, sets of {vi}. From that one can calculate a statisti- we provide a dictionary of the deep Boltzmann machines cal probability distribution Pev(vi). In machine learn- and the AdS/CFT, and construct a deep Boltzmann ma- ing, one trains the machine to mimic Pev(vi). In the chine for a bulk scalar field theory in generic curved ge- function P (vi) of the Boltzmann machine, the weights ometry. Discretization of the fields and the spacetime, and biases are trained parameters. The difference mea- and also the properties at the deepest layer are studied. sure between the two probability distributions, called error function, is the relative entropy (alternatively called In Sec. IV, we apply the standard large Nc limit (saddle point approximation) to the deep Boltzmann machine Kullback-Leibler (KL) divergence in machine learning) and see the consistency with the holographic linear re- X Pev(vi) sponse. In Sec. V, we propose how we identify weights DKL Pev(vi) P (vi) ≡ Pev(vi) log (5) interpreted as a spacetime, through a regularization using P (vi) {vi} Einstein action. Sec. VI is devoted to our summary and discussions. In Appendix A, we provide an autoencoder- and one tries to minimize it by changing the parameters. like neural network architecture for holographic renor- When this divergence is minimized, the machine is well- malization. trained. 3

FIG. 2: A schematic picture of Boltzmann machines. Double circles are visible units, and single circles are hidden units. Left: Boltzmann machine. Center: Restricted Boltzmann machine. Right: Deep Boltzmann machine.

The reason why the restriction in the graph of the III. ADS/CFT AS A BOLTZMANN MACHINE Boltzmann machines is important is the conditional independence. In (4), when {hi} is given, E is linear in vi, A. Dictionary so the probability P factorizes to a product of each unit vi, then the training requires a lot less computational Let us first describe the similarity between the resource. Note that without losing this conditional in- P j AdS/CFT correspondence and the deep Boltzmann ma- dependence we can add a term i,j wiδi hihj; The Kro- chine, and construct a dictionary between the two j necker delta δi means that this is a self-interaction within schemes. In the AdS/CFT correspondence [7], the fun- the same unit. What is not allowed for the conditional damental formula relating the boundary and the bulk is independence is the term like hihj or vivj with i 6= j in the GKP-W relation [8, 9] which is the same layer. Z [J] = exp (−S [φ]) . (8) Due to the hidden variables, the representation power QFT gravity of the restricted Boltzmann machines is greater. It is This expression is for the large N limit of the QFT with proven that with a sufficiently large number of hidden c its generating functional Z[J], while for the finite Nc this units any probability distribution is well approximated expression should be replaced by [42] [20], which is called a universal approximation theorem for the restricted Boltzmann machines. Adding more hid- Z den layers can help the representation power [41], and the ZQFT[J] = Dφ exp (−Sgravity[φ]) . (9) φ(z=0)=J following is called a deep Boltzmann machine [1], Here z is the emergent bulk coordinate, and z = 0 is the boundary of the asymptotically AdS bulk, where the N−1   X (0) (1) X X (k) (k) (k+1) boundary condition φ(z = 0) = J is put for the bulk field E ≡ w v h + w h h , (6) ij i j  ij i j  φ(x, z). i,j k=1 i,j The deep Boltzmann machine approximates a given provability distribution by the formula (III A) with the where the index k labels the hidden layers k = energy function (6). The similarity between the quan- (k) tum gravity version of the GKP-W relation (9) and the 1, 2, 3, ··· ,N. The hidden variables hi in N hidden layers, taking binary values, are summed as in (3): definition equation of the deep Boltzmann machine is ob- vious. The identification rules are as follows: the source function J(x) is the input value vi of the visible layer, X (k) (k) P (vi) = exp −E(vi, hi ) . (7) the bulk field φ(x, z) is the hidden variables hi , the h(k) emergent bulk coordinate z is the label for the hidden i layers k, the generating function Z[J] of the QFT is the provability distribution P (vi), and the bulk action S[φ] The visible layer consisting of units whose values are the (k) is the energy function E(vi, hi ). See Table III A for a input vi may be thought of as k = 0, which means vi = summary of the correspondence. The path integral of the (0) hi . The weights are again restricted as in the restricted bulk field φ is replaced by the summation over the hid- Boltzmann machines, see Fig. 2 Right. (k) den variables hi , so in general the quantum bulk of the Although the number of units in each hidden layer may AdS/CFT correspondence is a deep Boltzmann machine. not be equal to each other, in this paper we consider the The resemblance is basically the fact that the deep case of having the same number of units in each layer, Boltzmann machine tries to reproduce the probability for a structural simplicity. distribution whose input is the values at the visible layer, 4

AdS/CFT Deep Boltzmann machine There exists a relation among them, Bulk coordinate z Hidden layer label k (a(z)d(z))2 = (c(z)/d(z))d−1 . (17) QFT source J(x) Input value vi (k) Bulk field φ(x, z) Hidden variables hi In the standard Poincare coordinate system, the QFT generating function Z[J] Provability distribution P (vi) asymptotically AdSd+1 geometry is Bulk action S[φ] Energy function E(v , h(k)) i i dz2 + Pd (dxµ)2 ds2 = L2 µ=1 (z ∼ 0) (18) z2 and in the AdS/CFT in the same manner, the bulk path- with the AdS radius L, so we have the condition integration tries to reproduce the generating functional of the boundary QFT where the input is the boundary a ∼ b ∼ d ∼ (L/z)d−1, c ∼ (L/z)d+1 (19) value of the bulk field. In order to make the probability interpretation of the near the AdS boundary z ∼ 0. QFT generating functional, we normalize it as Let us discretize the action (12) to make it writ- Z ten like the energy function E of the deep Boltzmann PQFT[J] ≡ ZQFT[J]/Z0 ,Z0 ≡ DJZQFT[J] . (10) machine.[43] First, the bulk geometry is discretized to a regular lattice whose sites are labeled by (k, i, l); The Then, using the deep Boltzmann machine representation label k refers to the discretized bulk emergent direction of the bulk, training of the bulk theory is possible to z, reduce the error function which is given by the Kullback- Leibler divergence of the QFT partition function and the zk ≡ k∆z (k = 0, 1, 2, ··· ) (20) model probability of the Boltzmann machine, where ∆z is the lattice spacing. In the same manner, we I d Error fn. = DKL PQFT[J] P (vi) . (11) discretize x and τ ≡ x by the lattice spacing ∆x and ∆τ, giving the label i and l respectively, as xi,l. This As the deep Boltzmann machine allows arbitrary ar- simplest regularization scheme replaces the integration chitecture for its neural network, it is naturally expected over ddxdz by a sum P . that the AdS/CFT correspondence may be included as k,i,l The bulk field φ(x, z) at the sites are written as an example of the Boltzmann machine. Below we shall demonstrate that a typical AdS/CFT model allows a h(k) ≡ φ(x , z ). (21) deep Boltzmann machine architecture. i,l i,l k Thus the bulk scalar field is the variables in the hidden units. Naturally, we identify the label k as the label for B. Bulk as a neural network the layers of a deep Boltzmann machine. We define our visible layer as the AdS boundary value of the scalar field, The simplest bulk action is for a free massive scalar i.e. the first k = 0 component of h, field in an asymptotically AdSd+1 bulk geometry, v ≡ h(0) . (22) " d−1 i,l i,l Z 1 X S = ddxdz a(z)(∂ φ)2 + b(z) (∂ φ)2 2 z I The z-derivative term in the bulk Lagrangian is re- I=1 placed by 2 2 2 +d(z)(∂τ φ) + c(z)m φ . (12) 2 2 (φ(zk+1) − φ(zk)) We chose Euclideanized signature so that the AdS/CFT (∂zφ) = lim . (23) ∆z→0 (∆z)2 correspondence can fit the scheme of Boltzmann ma- d chines. The d-th coordinate τ ≡ x is the Euclideanized As for the derivative terms concerning ∂τ (and similarly n time coordinate. Local interaction terms such as φ can for ∂I ), we choose be treated similarly below, but in this paper we consider only the free case. 2 φ(xi,l+1, zk) − φ(xi,l, zk) (∂τ φ) = lim We assumed for simplicity that the metric depends ∆τ,∆z→0 ∆τ only on the bulk emergent direction z and is diago- φ(xi,l+1, zk+1) − φ(xi,l, zk+1) nal, and assumed also a homogeneous spacetime about · . (24) I ∆τ x (I = 1, 2, ··· , d − 1), then g11(z) = ··· = gd−1,d−1(z). We find Note the dependence on the label k; the reason we chose this discretization will be clear below. a(z) = [g (z)d−1g (z)/g (z)]1/2, (13) 11 dd zz The background metric functions are discretized in the d−3 1/2 b(z) = [g11(z) gdd(z)gzz(z)] , (14) same manner, d−1 1/2 c(z) = [g11(z) gdd(z)gzz(z)] , (15) ak ≡ a(z = zk), bk ≡ b(z = zk), d−1 1/2 d(z) = [g11(z) gzz(z)/gdd(z)] . (16) ck ≡ c(z = zk) dk ≡ d(z = zk). (25) 5

FIG. 3: The architecture of the deep Boltzmann machine for the AdS/CFT. The thick lines mean weights. The diﬀerence from the standard Boltzmann machines in Fig. 2 is that we allow a weight connecting the same unit, denoted as a curved line just above each unit in the ﬁgure. We omit drawing the l-direction.

Then the bulk action is written as written as

2 2 2 X X 1 (k+1) (k) m (k) ZQFT[J] = exp (−E) (30) S = ak hi,l − hi,l + ck hi,l 2(∆z)2 2 h k,i,l 1 + b h(k) − h(k) h(k+1) − h(k+1) where E is defined by (27). And through (22) we have k 2(∆x)2 i+1,l i,l i+1,l i,l (0) 1 J(xi,l) = vi,l = h . (31) + d h(k) − h(k) h(k+1) − h(k+1) . i,l k 2(∆τ)2 i,l+1 i,l i,l+1 i,l (26) This is a deep Boltzmann machine representation of the AdS/CFT correspondence. See Fig. 3 for our architec- This is recast to the following Boltzmann machine form: ture. The background metric appears as the weights of the  Boltzmann machine. As is understood from (28) and X X X n (k) (k) (k+1) (29), the weights are not all independent. They form S = E ≡  w h h ij,lm i,l j,m quite a sparse neural network. The trained variables are k i,j l,m a(z ), b(z ), c(z ) and d(z )(k = 0, 1, 2, ··· ), under the oi k k k k (k) (k) (k) constraint (17). The bulk scalar field appears as the hid- +weij,lmhi,l hj,m , (27) den variables to be summed, at which the boundary value where the weights are given as of the bulk scalar field is identified with the visible units. Note that, because we chose the discretization scheme (k) ak j m (24), the weightsw ˜ connecting the units in the same layer wij,lm ≡ − 2 δi δl j m (∆z) are completely diagonal (of the form δi δl ) as seen in b (29). As explained earlier, this does not violate the condi- + k 2δjδm − δj δm − δj+1δm 2(∆x)2 i l i+1 l i l tional independence of the units in the same layer, which is important for the training of the Boltzmann machine. d + k 2δjδm − δjδm − δjδm+1 , (28) 2(∆τ)2 i l i l+1 i l ak +ak−1 ck C. Discretized values of the bulk field w(k) ≡ + m2 δjδm . (29) eij,lm 2(∆z)2 2 i l Standard Boltzmann machines allow binary values for These weights are symmetric. the variables h, while the AdS/CFT correspondence re- The path integral over the bulk field φ(xI , τ, z) is quires continuous values for the bulk field φ(x, z). To equivalent to the integration over all the hidden variables bridge these two, we need to discretize also the field value (k) hi,l (k = 1, 2, ··· ), therefore the GKP-W relation (9) is space. Suppose that typical values necessary for training 6 the Boltzmann machines are in the range |φ| < A. Then Next, consider the black hole horizon condition in- a natural discretization of the values is given as stead. At the black hole horizon the zz component of the metric diverges, while the temporal dd component u φ = A (32) vanishes. Thus a(z) = 0, and d(z) diverges, while b(z) u0 and c(z), and a(z)d(z) remain finite and nonzero. There- fore, the black hole boundary condition is where u = −u0, −u0 + 1, ··· , 0, ··· , u0 − 1. In this discretization, we have 2u different values for φ to take. 1 0 a(z ) ∝ ∆z, d(z ) ∝ , (36) To bring them to a set of binary-valued variables, we N−1 N−1 ∆z introduce the binary variable s ∈ {0, 1} as (˜u) with infinitesimally small ∆z. The confining condition (35) and the horizon condition 2˜u 1 φ = A −1 + + s(˜u) . (33) (36) are examples of more general constraints. We can u0 u0 impose other boundary conditions if they are consistent with the large N limit of the AdS/CFT, as we shall Here we divided the single entry φ to u entries s with c 0 (˜u) study in the next section. (k) u˜ = 0, 1, ··· , u0 − 1. In effect, each unit referring to hi,l For the pure AdS geometry, there is no IR end of the in the Boltzmann machine is split into u0 different units space, and the z direction is extended to z = ∞. So, to s. All of those split units need to share the same weight host all possible asymptotically AdS spacetimes in our for every original connection with different unit h. In Boltzmann machine architecture, we need to prepare in- this manner, binary-valued Boltzmann machines can be finitely deep Boltzmann machines.[44] constructed from the continuous-valued Boltzmann machines. IV. SADDLE POINT OF BOLTZMANN MACHINE D. The deepest layer is the end of space The AdS/CFT correspondence has been studied in the In the AdS/CFT correspondence, the IR end of the large Nc limit of the QFT, because it is the classical limit geometry is important, as it directly reflects the proper- of the bulk which is the only reliable gravity calculation, ties of allowed spectra of the QFT. Popular holographic in the absence of satisfactory quantum gravity formula- geometries are confining geometries and black holes, and tion. The large Nc limit, or the classical limit of the they have specific boundary conditions at the IR end of gravity theories, is equivalent to the zero temperature the geometry. Except for the cases of conformal field the- limit of the Boltzmann machine, E replaced by E/T and ories as the boundary QFT, the bulk geometry naturally T → 0. At the limit, gravity theory can be well approx- terminates at some IR scale z = zIR. In the terminol- imated by saddle points — the solutions of the classical ogy of the deep Boltzmann machines, this means that equations of motion, and the on-shell action is simply the layers terminate at k = N with N ≡ zIR/∆z. Let substituted to the right hand side of (8). us rephrase those geometric boundary conditions to the The zero-temperature limit of Boltzmann machines has treatment around the deepest layer k = N of the deep not been studied extensively, because the hidden/visible Boltzmann machine. variables in ordinary Boltzmann machines take only bi- First of all, the layers actually terminates at k = N, nary values and the saddle approximation is not effective. and there is no additional layer at k = N + 1. In terms In our case, as described in Sec. III C, we consider a cer- of the weights, this condition means w(N) = 0, which is tain limit of binary-valued Boltzmann machines to ac- ij,lm quire continuous-valued variables. There the equations of motion, and the saddle points, make sense. In this a(z ) = b(z ) = d(z ) = 0 . (34) N N N section, we study consistency conditions of the classi- The confining geometry refers to the Dirichlet bound- cal limit (equivalently, the zero temperature limit, or the ary condition for the bulk field φ, as it simply means saddle point approximation) of the deep Boltzmann ma- that the bulk field φ needs to vanish in the spacetime in chine given in the previous section. For simplicity we treat the variables h as continuous variables. the region specified by z > zIR. This location is called a “hard wall” in holography. In general, the condition of First, let us consider the standard restricted Boltz- the hard wall means that the metric function which the mann machine (4) with continuous-valued variables, and scalar field feels has a special behavior there. In fact, to how the classical limit causes an inconsistency. The sad- 2 dle point equation is impose φ(zIR) = 0 we just need that the mass c(z)m at z = zIR diverges. So, in this case we can rephrase δE X the Dirichlet boundary condition in terms of the metric 0 = = bi + wjivj . (37) δhi function: j

Since bi and wji are the parameters to be ﬁxed after the c(z ) = ∞ . (35) N training with various sets of {vi}, this equation cannot be 7

FIG. 4: The simpliﬁed deep Boltzmann machine at which the i- and l-dependence are ignored.

satisfied. Therefore, restricted Boltzmann machines with The saddle point equation is continuous hidden variables do not allow the saddle point δE X approximation, on the contrary to the physical intuition. 0 = = b + 2ch + (w v ) , (45) δh i i ji j Adding more hidden layers can resolve the issue. Sup- i j pose we have another hidden layer to the restricted Boltz- mann machine, which determines the value of the hidden unit hi in terms of the input vi, so it gives a consistent solution. The on- X (1) (2) shell value of the energy function is E ≡ (aivi + bihi + cihi ) i X eff X eff Eon−shell = E0 + ai vi + wij vivj , (46) X (0) (1) (1) (1) (2) ij ij + wij vihj + wij hi hj . (38) ij where Then the saddle point equation is 1 X E ≡ − b b , (47) 0 4c i i δE X (0) (1) (2) i 0 = = b + w v + w h , (39) (1) i ji j ij j 1 X δh aeff ≡ a − w b , (48) i j i i 2c ij j j δE X (1) (1) 0 = = ci + wji hj . (40) 1 δh(2) weff ≡ − wwT . (49) i j ij 4c ij (2) Thus, as is expected, the effective energy function is bi- The first equation determines hj for any given training linear in vi. value of vi, so it gives a consistent saddle point equation. The second equation simply shows that the middle layer Keeping these results in mind, we consider the deep variable h(1) takes a fixed value −[[w(1)]T ]−1c. So, sub- Boltzmann machine which we defined in the previous sec- stituting these to the original energy function (38), we tion. The saddle point condition is obtain the saddle point approximation of the restricted E 0 = Boltzmann machine, (k) δhi,l X eff X h (k−1) (k−1) (k) (k+1) (k) (k) i Eon−shell = E0 + a vi , (41) i = wji,ml hj,m + wij,lmhj,m +w ˜ij,lmhj,m . i j,m X (1) −1 (50) E0 ≡ − (w )ij cibj , (42) ij So, the variables at the layer k are related to those of the eff X (1) −1 (0) T ai ≡ ai − cj (w ) (w ) . (43) layer k + 1 and of the layer k − 1. The equation has both ji j the properties of the cases of (38) and (44). Let us study the consistency with the IR boundary Here it should be noted that the obtained energy function condition, the deepest layer. For simplicity, to look at the is linear in v, so it does not have the form of the standard consistency, we consider the case with a homogeneous φ Boltzmann machines whose energy functions are bilinear in xI and xd, which is equivalent to ignore the terms with in v. The reason of the linearity is that the saddle point b(z) and those with d(z). The architecture of the deep equations for the k-odd and the k-even layers decouple Boltzmann machine is shown in Fig. 4. At the deepest from each other. layer k = N, the saddle point equation gives Instead of adding more layers, we can introduce a self- j h(N−1) ∼ h(N) (51) coupling δi hihj as described earlier in Sec. II. For the case with just a single hidden layer with a uniform self- where the symbol ∼ denotes a linear relation whose coef- coupling weight, we have ficients are given by weights. Similarly, using the saddle X X point equation at k = N − 1 which gives E ≡ (aivi + bihi) + (wijvihj + c δijhihj) . (44) (N−1) (N−2) (N) i ij h + h ∼ h (52) 8 where we omit the coefficients. Then altogether, they (29). Although it can be easily reconstructed, there is give h(N−2) ∼ h(N−1). Repeating this backwards in lay- one issue: generically, the training ends up with various ers, we finally obtain different sets of weights, because the error function may have many almost degenerate local minima. Each of the h(0) ∼ h(1) (53) local minima can approximate PQFT very well — which is related to the notion “generalization” in machine learn- which can also be written as ing. h(1) − h(0) We are looking for a gravity dual. For the trained v ∼ . (54) ∆z Boltzmann machine weights to be interpreted as a bulk spacetime, we need a criterion to pick up a certain set In the continuum limit, this relation is of the weights among the degenerate local minima. The φ(z = 0) ∼ ∂ φ(z = 0). (55) criterion is simple: use an Einstein action for a regular- z ization of the deep Boltzmann machine.[46] In the boundary QFT of the AdS/CFT correspondence, Basically, the generic trained weights take quite scat- this relation is equivalent to the linear response relation tered values, and they are not a smooth function of z in [22], the continuum limit ∆z → 0. For those configurations of weights, the Einstein action takes a large value. On J ∼ hOi . (56) the other hand, smooth metric functions of z, and so a set of the weights whose values do not drastically vary as Thus, the deep Boltzmann machine is found to be consis- one sweeps the depth of the layers, have lower values of tent with the standard analysis in the classical bulk side the Einstein action. Therefore, the Einstein action can of the AdS/CFT correspondence. be used for selecting a proper set of weights which has a It is intriguing that the saddle point approximation bulk spacetime interpretation. provides explicitly the relation between the variables at A proposed regularization term is a discretization of the adjacent layers. This relation is expected for neu- the Einstein action with a negative cosmological term, ral networks of the feed-forward type. So, we find that the saddle point approximation of the deep Boltzmann Z d(d − 1) E = ddxdzpdet g R + . (57) machine provides a feed-forward architecture. A subtle reg L2 difference from the standard feed-forward is that the linear relation starts at the deepest layer, not at the visible To obtain the explicit discretization, for simplicity we layer. In fact, looking at only the first hidden layer we consider a conformal spacetime find that the relation is just like (52), so it is not a linear relation between just the adjacent two layers. In fact, L2 g = g = (1 + α(z)) . (58) the scalar field equation is the second order differential µµ zz z2 equation, so, there is a backward wave in addition to the forward wave. These two waves satisfy the consis- When α(z) = 0, the metric reduces to the pure AdS tency condition at the deepest layer. Therefore, the sad- metric. So the asymptotically AdS spacetimes allow only dle point approximation provides a “folded feed-forward” limz→0 α(z) = 0. In terms of the previous a(z), b(z), c(z) structure. Unfolding the folded structure is possible, and and d(z), this ansatz leads to in Appendix A we provide an architecture of the unfolded d−1 (d−1)/2 type, which looks like an autoencoder. a = b = d = (L/z) (1 + α(z)) , c = (L/z)d+1(1 + α(z))(d+1)/2 . (59)

V. REGULARIZATION AND EINSTEIN So, the discretization of the z direction as z = zk ACTION provides a lattice on which αk ≡ α(zk) is defined for k = 0, 1, 2, ··· . For d = 3 as an example, the Einstein In this section we study the condition for the trained action becomes weights to be interpreted as a bulk spacetime. The train- " # E Z −2 2 1 (α0(z))2 ing should be performed in the following manner. First, reg = ddxdz + α(z)2 + . (60) prepare a quantum field theory for which one wants to 3L2 z4 z2 2z2 1 + α(z) know whether a gravity dual exists or not. Then calculate ZQFT[J] and its probability interpretation PQFT[J] Discretizing this action, we obtain the regularization by (10) [45]. Prepare the deep Boltzmann architecture term is the error function given in Sec. III B, and by updating the weights to re- ∞ " 2 # duce the KL divergence (11). Once the KL divergence X 2 2 1 (αk+1 − αk) Ereg = c 2 αk + 2 . (61) decreases to enough accuracy, we say that the bulk is k 2k 1 + αk k=1 learned. The metric function is encoded in the sparse weights Here c is a positive constant, and we ignored an additive w, w˜ in the deep Boltzmann machine, given in (28) and constant term which is irrelevant for the training. The 9 additive constant comes from the first term in (60), that Our study provides a relation between the AdS/CFT is, the cosmological constant for the pure AdS spacetime. correspondence and the deep Boltzmann machine. In It is easy to see that this regularization in fact favors view of the history of the quantum gravity, introducing a smoother distribution of the weights, due to the sec- discretization of the spacetime is natural, and we hope ond term in (61). Using this regularization, during the that more concepts on Boltzmann machines and deep training, the Boltzmann machine tries to minimize also learning can be imported to quantum gravity, so that it the Einstein action at the same time. When the error may shed light on the mystery of the bulk emergence in decreases to a satisfactory small value, the weights can the holographic principle. be interpreted as an Einstein spacetime.[47] Several clarification and comments are in order. First, When Ereg = 0, we have a pure AdS geometry. In the discretization of the spacetime used in this paper fa- generic AdS/CFT correspondence, the bulk action can vors a certain coordinate system, and thus the general take various forms; it may have more supergravity fields, coordinate transformation of the gravity theory is not and it may suffer from higher derivative terms coming seen in our framework. Furthermore, even the isometry from quantum gravity corrections or stringy corrections. transformation, which is the scale transformation in the Therefore, in general, the regularization needs to allow QFT, is difficult to be implemented in our formulation. more generic actions, such as A hyperbolic network (as used in [14]) is better to be Z consistent with the isometry, but is difficult to find a con- d p 2 3 Ereg = d xdz det g c0 + c1R + c2R + c3R + ··· , tinuum limit. It would be interesting to seek for a more desirable discretization scheme. In fact, a well-known ap- (62) proach for quantum gravity uses dynamical triangulation or more with tensorial structures. This can be discretized [12] in which connection bond topology (which is axon by the same method, and we obtain a more general Ein- topology in neural networks) is dynamical. On the other stein regularization. Note that here in the expression the hand standard neural networks have a fixed architecture while the weights are variable. We may need a refined coefficients ci are trained variables. In general we do not know the bulk gravity action, so we need to allow gen- discretization architecture, to have a more unified view eral action. When we say that the bulk is a spacetime, of the quantum gravity and the deep learning. Generic it means that it reduces the value of this general action. quantum gravity may include even non-geometric land- When the powers of the Riemann tensors stop at some scape for which machine learning have been applied [24– fixed value, a low energy effective spacetime interpreta- 30], and a possible relation to our approach of having tion is possible. neural network as a spacetime would be interesting. Second, we have introduced the saddle point approximation in the evaluation of the deep Boltzmann ma- VI. SUMMARY AND DISCUSSIONS chine, based on the standard large Nc argument of the AdS/CFT correspondence. At the limit, linear rela- In this paper, we have shown that the standard tions among weights at layers close to each other are AdS/CFT correspondence can be regarded as a deep derived, and the information at the visible layer is pro- Boltzmann machine. The neural network architecture, cessed through the bulk, as if it propagates. This means once properly defined, is interpreted as a bulk spacetime that the saddle point of the deep Boltzmann machine geometry. The network depth is the emergent direction brings it to a folded feed-forward type deep neural net- in the bulk, and the network weights are metric compo- work. The AdS/CFT interpretation of a feed-forward nents. Hidden variables correspond to discretized fields neural network was studied in [15, 16] and the trained in the bulk, and the probability distribution given by the weights exhibit an interesting physical picture. Boltzmann machine is the generating functional of the On the other hand, at finite Nc (beyond the classical QFT dual to the bulk gravity. limit of the bulk), the relation between the AdS/CFT For the mapping we used a bulk scalar field theory in and the deep Boltzmann machine is a little ambiguous curved geometries. The IR boundary conditions of the — in the bulk, only the scalar field (∼ hidden variables bulk, such as the black hole horizon or the hard wall, h) is path-integrated while the metric (∼ weights w) is can be implemented to the weight behavior around the not. This situation can be interpreted as that the scalar deepest layer of the Boltzmann machine. The large Nc field is that of a probe brane in the bulk. What is the limit of the AdS/CFT is argued in the scheme of the metric path-integral in the deep Boltzmann machine? It Boltzmann machine, consistently giving an organized set is a statistical summation of the network weights, which of linear equations among weights. has been studied as statistical neural networks [31]. It Among many degenerate vacua of the deep Boltzmann would be interesting to see more connection between the machine, a set of weights which allows a spacetime in- holographic principle and the statistical neural networks. terpretation is selected by a regularization in the error In fact, in [32] a conformal transformation of data space function in addition to the KL divergence. We have in- was found at a layer-to-layer propagation, and it may troduced a natural regularization based on the Einstein allow a holographic interpretation. action and its generalization. Finally, we make a comment on a relation to quan- 10 tum information. It is known that the AdS/CFT corre- The important point for the construction is to use spondence has a close relation to quantum information, the holographic renormalization group [38] to divide the in particular AdS/CFT toy models based on tensor net- second-order differential equation in z to a set of two works have been studied. The structure of MERA [10] first-order equations (for the non-normalizable and nor- has a bulk hyperbolic space interpretation, and tensor malizable modes). Generic autoencoders have two im- networks using perfect tensors [11] provide a quantum portant features: their weights are left-right symmetric, correspondence between the bulk and the boundary in and they reduce dimensions of the data space at the neck the AdS/CFT. Since it is known [23] that any quan- of the network. In our holographic autoencoder, the left tum code allows its deep Boltzmann machine interpre- and right are governed by the same metric, and the con- tation, the AdS tensor networks can be mapped to deep volution near the neck is red-shifted so that the weights Boltzmann machines. In general obtained machine ar- effectively reduce at the neck. chitecture tends to be complicated (since the number of Here we present details of a construction of the holo- quantum gates necessary to reproduce A-leg tensor is graphic autoencoder. First, we explain how the second- ∼ O(22A)), so a continuum limit to have a continuum order differential equation of the scalar field in the bulk field theory in the bulk, which we studied in this paper, can be equivalently replaced by a set of two first-order is difficult to take. Further studies for bridging the holo- differential equations. This is important for the imple- graphic principle and deep learning are desired. mentation of the scalar system in the form of a deep neural network, because typically neural networks have Note added: while this manuscript was prepared, we no- the structure of the inter-layer propagation which is nat- ticed that Ref. [33] which interprets the bulk as a deep urally interpreted as a first order differential equation generative model was submitted to arXiv recently. when the continuum limit of the layers is taken[48]. The decoupling among the non-normalizable and the normalizable modes in the AdS/CFT correspondence with the bulk scalar field φ works only for the free case, Acknowledgments when the interaction vanishes, V [φ] = 0. This is possible at the linear response level of the AdS/CFT correspon- The author would like to thank Shun-ichi Amari, dence. In this appendix we use the coordinate system Masatoshi Imada, Akinori Tanaka, Akio Tomiya and where the the AdS radial coordinate η is a proper coor- Yi-Zhuang You for valuable discussions. The work of dinate, K.H. was supported in part by MEXT/JSPS KAK- 2 2 2 2 2 ENHI Grants No. JP15H03658, No. JP15K13483 and ds = −f(η)dt + dη + g(η)(dx1 + ··· + dxd−1), (A1) No. JP17H06462. p d−1 and define h(η) ≡ ∂η log f(η)g(η) . Then the free stationary scalar field equation is

∂2 + h(η)∂ + ( − m2) φ = 0, (A2) Appendix A: Holographic autoencoder η η

−2 Pd−1 2 where ≡ g(η) i=1 ∂i is a covariant Laplacian. As In this appendix we describe an implementation of the opposed to the case in [15, 16], here we included the spa- AdS/CFT correspondence into a feed-forward deep neu- tial (x) dependence of the external field and the response. ral network of an autoencoder-like architecture. We dis- According to the holographic renormalization group [38] cuss only the classical limit of the bulk (the large Nc the equation (A2) can be rewritten as limit) to make sure that the feed-forward structure is clear in the AdS/CFT. [∂η − b][∂η − a] φ = 0 , (A3) In [15, 16], a deep neural network employs J and hOi as the input at the initial layer while the black hole hori- with zon boundary condition is used as an output at the final layer. Another natural implementation is to use J (the b(η, −) + a(η, −) = −h(η), (A4) non-normalizable mode of the AdS scalar field) as an in- 2 b(η, −)a(η, −) − ∂ηa(η, −) = − m . (A5) put and hOi (normalizable mode) as the output data. In this case the data propagates from the AdS bound- These constraint equations are simply ary toward the black hole horizon first, then it bounces 2 2 with the boundary condition, then propagates back to a + h(η)a + ∂ηa = − + m . (A6) the AdS boundary again. Therefore the neural network is an hourglass-type, and is naturally interpreted as an This generically allows two solutions a(−) ≡ f±(−), autoencoder in machine learning, see Fig. 5. The neural and with each of them, the scalar field equation reduces network looks similar to the two-sided black hole geome- to a first order differential equation in η, try which is often used in finite temperature holography [34–37]. [∂η − f±(η, −)] φ = 0 . (A7) 11

FIG. 5: Holographic autoencoder. The depth of the lines show average weights, which decrease toward the black hole horizon (the neck part of the neural network).

These two equations with f± govern the non- mode. The input data J(x) is placed at the initial layer normalizable and the normalizable modes, respectively. (n) η = ηini ∼ ∞. It propagates with W+ toward the Therefore, once a spacetime bulk metric is given, we find black hole horizon η = 0, and is transformed to a data two functions f±(−), and use them to define the neu- φ+(x, η = 0). At the black hole horizon η = 0, we need ral network by discretizing the η direction as η = n∆η. to impose the boundary condition ∂ηφ = 0. Generically Noting that the discretization of η gives φ+(x, η = 0) does not satisfy it, so we need to comple- ment it as φ(η + ∆η) − φ(η) ∂ φ(η, x) = , (A8) η ∆η ∂η [φ+(x, η) + φ−(x, η)]η=0 = 0 . (A12) and the discretized spatial dependence is interpreted as a convolution in the neural network, This defines the initial condition for the right hand side of the neural network, φ−(x, η), which propagates to- (n) 2 φ(xi + ∆x) − 2φ(xi) + φ(xi − ∆x) ward η = ∞ with W− . Then we identify the output ∂i φ(xi) = , (A9) (∆x)2 φ−(x, η = ∞) as hO(x)i. We call this whole network, shown in Fig.5, a holographic autoencoder. then we find that the neural network is defined In reality for the training, we may focus on slowly vary- ing external field and use the low momentum expansion φ(n+1)(x ) = W (n)φ(n)(x ) (A10) i ± i f± = c±(η) + d±(η) + ··· , then the weights are given as where the network weights are (n) (n) (n) (n) (n) W± = 1 + ∆η c±(η ) + d±(η ) . (A13) W± = 1 + ∆η f±(η , −). (A11)

Note that we also discretize the (d − 1)-dimensional We train the coefficient function c± and d± with a con- space of the boundary QFT, as in (A9), where the co- straint that both consistently solve (A6). variant Laplacian can be identified as a convolution in Using the horizon behavior h(η) ∼ 1/η and neural network. Generically, spatial derivatives in field g(η) ∼const., we find that (A6) has a universal solution equations are identified as a combination of weights con- 2 necting nearby units. The locality of the bulk field theory c± ∼ η m /2, d± ∼ −η /2. (A14) is a constraint of the weights of the neural networks. In this way, we can always include spatial dependence of This means that effectively the weight W near the black the external field and the response, as a convolutional hole horizon vanishes (except for the trivial “1+” part), neural network. So, (A11) defines a convolutional neu- due to the red shift factor via h(η). Therefore, the ef- ral network equivalent to (A2), with a trivial activation fective dimensions of the data space around the central function. part of the holographic autoencoder decrease, which is The left hand side of the neural network is governed by suitable for the name “autoencoder” usually used in ma- the propagation weight (A11) for the non-normalizable chine learning. 12

[1] R. Salakhutdinov and G. Hinton, “Deep boltzmann ma- of restricted Boltzmann machines and deep belief net- chines,” Proceedings of the International conference on works,” Neural computation 20.6 1631 (2008). Artificial intelligence and statistics, Vol.12, 448 (2009). [21] H. Abarbanel, P. Rozdeba, S. Shirman, “Machine Learn- [2] G. E. Hinton and R. R. Salakhutdinov, “Reducing the ing, Deepest Learning: Statistical Data Assimilation dimensionality of data with neural networks.,” Science Problems,” arXiv:1707.01415 [cs.AI]. 313, 504 (2006). [22] I. R. Klebanov and E. Witten, “AdS / CFT correspon- [3] Y. Bengio, Y. LeCun, “Scaling learning algorithms to- dence and symmetry breaking,” Nucl. Phys. B 556, 89 wards AI,” Large-scale kernel machines 34 (2007). (1999) [hep-th/9905104]. [4] Y. LeCun, Y. Bengio, G. Hinton, “Deep learning,” Na- [23] X. Gao and D. Lu-Ming, “Efficient representation ture 521, 436 (2015). of quantum many-body states with deep neural [5] D. H. Akley, G. E. Hinton and T. J. Sejnowski, “A Learn- networks,” Nature communications 8.1 662 (2017), ing Algorithm for Boltzmann Machines,” Cognitive Sci- arXiv:1701.05039 [cond-mat.dis-nn]. ence 9, 147-169 (1985). [24] Y. H. He, “Deep-Learning the Landscape,” [6] R. Salakhutdinov and H. Larochelle, “Efficient learning arXiv:1706.02714 [hep-th]. of deep Boltzmann machines,” Proceedings of the thir- [25] Y. H. He, “Machine-learning the string landscape,” Phys. teenth international conference on artificial intelligence Lett. B 774, 564 (2017). and statistics 2010, 693. [26] J. Liu, “Artificial Neural Network in Cosmic Landscape,” [7] J. M. Maldacena, “The Large N limit of superconfor- arXiv:1707.02800 [hep-th]. mal field theories and supergravity,” Int. J. Theor. Phys. [27] J. Carifio, J. Halverson, D. Krioukov and B. D. Nel- 38, 1113 (1999) [Adv. Theor. Math. Phys. 2, 231 (1998)] son, “Machine Learning in the String Landscape,” JHEP [hep-th/9711200]. 1709, 157 (2017) [arXiv:1707.00655 [hep-th]]. [8] S. S. Gubser, I. R. Klebanov and A. M. Polyakov, “Gauge [28] F. Ruehle, “Evolving neural networks with genetic algo- theory correlators from noncritical string theory,” Phys. rithms to study the String Landscape,” JHEP 1708, 038 Lett. B 428, 105 (1998) [hep-th/9802109]. (2017) [arXiv:1706.07024 [hep-th]]. [9] E. Witten, “Anti-de Sitter space and holography,” Adv. [29] A. E. Faraggi, J. Rizos and H. Sonmez, “Classification of Theor. Math. Phys. 2, 253 (1998) [hep-th/9802150]. Standard-like Heterotic-String Vacua,” arXiv:1709.08229 [10] B. Swingle, “Entanglement Renormalization and Holog- [hep-th]. raphy,” Phys. Rev. D 86, 065007 (2012) [arXiv:0905.1317 [30] J. Carifio, W. J. Cunningham, J. Halverson, D. Kri- [cond-mat.str-el]]. oukov, C. Long and B. D. Nelson, “Vacuum Selection [11] F. Pastawski, B. Yoshida, D. Harlow and J. Preskill, from Cosmology on Networks of String Geometries,” “Holographic quantum error-correcting codes: Toy arXiv:1711.06685 [hep-th]. models for the bulk/boundary correspondence,” JHEP [31] S.-i.Amari, “A method of statistical neurodynamics,” 1506, 149 (2015) doi:10.1007/JHEP06(2015)149 Kybernetik 14.4, 201 (1974). [arXiv:1503.06237 [hep-th]]. [32] S.-i. Amari, R. Karakida, M. Oizumi, “Statistical Neu- [12] T. Regge, “General Relativity Without Coordinates,” rodynamics of Deep Networks: Geometry of Signal Nuovo Cim. 19, 558 (1961). doi:10.1007/BF02733251 Spaces,” arXiv:1808.07169 [cond-mat.dis-nn]. [13] J. Ambjorn and R. Loll, Nucl. Phys. B 536, 407 (1998) [33] H. Y. Hu, S. H. Li, L. Wang and Y. Z. You, “Ma- doi:10.1016/S0550-3213(98)00692-0 [hep-th/9805108]. chine Learning Holographic Mapping by Neural Net- [14] Y. Z. You, Z. Yang and X. L. Qi, “Machine Learn- work Renormalization Group,” arXiv:1903.00804 [cond- ing Spatial Geometry from Entanglement Fea- mat.dis-nn]. tures,” Phys. Rev. B 97, no. 4, 045153 (2018) [34] J. M. Maldacena, “Eternal black holes in anti-de Sitter,” doi:10.1103/PhysRevB.97.045153 [arXiv:1709.01223 JHEP 0304, 021 (2003) [hep-th/0106112]. [cond-mat.dis-nn]]. [35] T. Hartman and J. Maldacena, “Time Evolution of En- [15] K. Hashimoto, S. Sugishita, A. Tanaka and A. Tomiya, tanglement Entropy from Black Hole Interiors,” JHEP “Deep learning and the AdS/CFT correspon- 1305, 014 (2013) [arXiv:1303.1080 [hep-th]]. dence,” Phys. Rev. D 98, no. 4, 046019 (2018) [36] H. Matsueda, M. Ishihara and Y. Hashizume, “Tensor doi:10.1103/PhysRevD.98.046019 [arXiv:1802.08313 network and a black hole,” Phys. Rev. D 87, no. 6, [hep-th]]. 066002 (2013) [arXiv:1208.0206 [hep-th]]. [16] K. Hashimoto, S. Sugishita, A. Tanaka and [37] A. Mollabashi, M. Nozaki, S. Ryu and T. Takayanagi, A. Tomiya, “Deep Learning and Holographic “Holographic Geometry of cMERA for Quantum QCD,” Phys. Rev. D 98, no. 10, 106014 (2018) Quenches and Finite Temperature,” JHEP 1403, 098 doi:10.1103/PhysRevD.98.106014 [arXiv:1809.10536 (2014) [arXiv:1311.6095 [hep-th]]. [hep-th]]. [38] S. de Haro, S. N. Solodukhin and K. Skenderis, “Holo- [17] W. C. Gan and F. W. Shu, “Holography as deep learn- graphic reconstruction of space-time and renormalization ing,” Int. J. Mod. Phys. D 26, no. 12, 1743020 (2017) in the AdS / CFT correspondence,” Commun. Math. [arXiv:1705.05750 [gr-qc]]. Phys. 217, 595 (2001) doi:10.1007/s002200100381 [hep- [18] E. Howard, “Holographic Renormalization with Machine th/0002230]. learning,” arXiv:1803.11056 [physics.gen-ph]. [39] R. K. Srivastava, K. Greff, J. Schmidhuber, “Highway [19] J. W. Lee, “Quantum fields as deep learning,” networks,” arXiv:1505.00387 [cs.LG]. arXiv:1708.07408 [physics.gen-ph]. [40] See also an essay [19]. [20] N. Le Roux and Y. Bengio, “Representational power [41] Indeed, folding the deep Boltzmann machines leads to an 13

RBM. [45] For this, one may need to use some approximation and [42] We omit various details in this formula, such as the con- some assumption on superselection sectors to perform the formal dimension dependence. path integrations. [43] A continuum limit of the deep layers was studied in a [46] The word “regularization” here is for Tikhonov regular- diﬀerent context [21]. ization, not for the lattice regularization. [44] However, note that z = ∞ corresponds to the strictly [47] Note also that the black hole horizon behavior of the vanishing energy in the QFT, which are not the main weights, (36), is consistent with this regularization. constituent of the partition function Z[J] except for IR [48] Highway networks [39] may be another way. regularization ambiguities.