Arxiv:2006.14698V1 [Math.NA] 10 Jun 2020
Total Page:16
File Type:pdf, Size:1020Kb
Entanglement-Embedded Recurrent Network Architecture Entanglement-Embedded Recurrent Network Architecture: Tensorized Latent State Propagation and Chaos Forecasting Xiangyi Meng1, a) and Tong Yang2, b) 1)Center for Polymer Studies and Department of Physics, Boston University, USA 2)Department of Physics, Boston College, USA (Dated: 29 June 2020) Chaotic time series forecasting has been far less understood despite its tremendous potential in theory and real-world applications. Traditional statistical/ML methods are inefficient to capture chaos in nonlinear dynamical systems, espe- cially when the time difference Dt between consecutive steps is so large that a trivial, ergodic local minimum would most likely be reached instead. Here, we introduce a new long-short-term-memory (LSTM)-based recurrent architec- ture by tensorizing the cell-state-to-state propagation therein, keeping the long-term memory feature of LSTM while simultaneously enhancing the learning of short-term nonlinear complexity. We stress that the global minima of chaos can be most efficiently reached by tensorization where all nonlinear terms, up to some polynomial order, are treated ex- plicitly and weighted equally. The efficiency and generality of our architecture are systematically tested and confirmed by theoretical analysis and experimental results. In our design, we have explicitly used two different many-body entan- glement structures—matrix product states (MPS) and the multiscale entanglement renormalization ansatz (MERA)—as physics-inspired tensor decomposition techniques, from which we find that MERA generally performs better than MPS, hence conjecturing that the learnability of chaos is determined not only by the number of free parameters but also the tensor complexity—recognized as how entanglement entropy scales with varying matricization of the tensor. I. INTRODUCTION ing methods. The notorious indication of chaos, jdx j ≈ elt jdx j; (1) Time series forecasting1, despite its undoubtedly tremen- t 0 dous potential in both theoretical issues (e.g., mechanical (where l denotes the spectrum of Lyapunov exponents) sug- analysis, ergodicity) and real-world applications (e.g., traffic, gests that the difficulty of chaos forecasting is two-fold: first, weather), has long been known as an intricate field. From any small error will propagate exponentially, and thus multi- classical work of statistics such as auto-regressive moving step-ahead predictions will be exponentially worse than one- 2 average (ARMA) families and basic hidden Markov mod- step-ahead ones; second, and more subtly, when the actual 3 els (HMM) to contemporary machine-learning (ML) meth- time difference Dt between consecutive steps increases, the 4 ods such as gradient boosted trees (GBT) and neural net- minimum redundancy needed for smoothly descending to the works (NN), the essential complexity inhabiting in time series global minima during NN training also increases exponen- has been more and more recognized, the forecasting models tially. Most studies on chaos forecasting only address the first having extended themselves from linear, Markovian assump- difficulty by improving prediction accuracy at the global min- tions to nonlinear, non-Markovian, and even more general sit- ima, yet the latter is indeed more crucial, especially when Dt 5 uations . Among all known methods, recurrent NN archi- is so large that a trivial, ergodic local minimum would most 6 7 tectures , including plain recurrent neural networks (RNN) likely be reached instead. Recently, tensorization has been 8 and long short-term memory (LSTM) , are the most capa- introduced in recurrent NN architectures16,17. A tensorized ble of capturing the complexity, as they admit the funda- version of HO-RNN/LSTM, namely HOT-RNN/LSTM18, has mental recurrent behavior of time series data. LSTM has claimed its advantage in learning long-term nonlinearity on 9 proved useful on speech recognition and video analysis tasks Lorenz systems of small Dt. On the one hand, we believe where maintaining long-term memory is the essential com- that the global minima of chaos (where dominance of linear plexity. To this objective, novel architectures such as higher- dependence is absent) can be most efficiently reached by ten- 10 arXiv:2006.14698v1 [math.NA] 10 Jun 2020 order RNN/LSTM (HO-RNN/LSTM) have been introduced sorization approaches, where all nonlinear terms, up to some to capture long-term non-Markovianity explicitly, further im- polynomial order, are treated explicitly and weighted equally. proving performance and leading to more accurate theoretical On the other hand, for simple chaotic dynamical systems, non- analysis. linear complexity is only encoded in the short term, not long Still, another dominance of complexity—chaos—has been term, which HO/HOT models will not be very useful in cap- far less understood. Even though enormous theory/data- turing when Dt is large. Hence, a new tensorization-based driven studies on forecasting chaotic time series by recurrent recurrent NN architecture is desired so as to push our under- NN have been conducted11–15, there is still no clear guidance standing of chaos in time series and meet practical needs, e.g., of which features play the most general role in chaos forecast- modeling laminar flame fronts and chemical oscillations19–22. In this paper, we introduce a new LSTM-based architec- ture by tensorizing the cell-state-to-state propagation therein, keeping the long-term memory feature of LSTM while si- a)Electronic mail: [email protected]. multaneously enhancing learning of short-term nonlinear b)Electronic mail: [email protected]. complexity. Compared with traditional LSTM architec- Entanglement-Embedded Recurrent Network Architecture 2 23 tures including stacked LSTM and other aforementioned Next, a realization of an LSTM is given by letting state st statsitics/ML-based forecasting methods, our model is shown and cell state ct be h-dimensional covectors and input xt to be a general and the best approach for capturing chaos in al- be a d-dimensional covector (Fig. 1). Therefore Wi (also most every typical chaotic continuous-time dynamical system Wm, W f , and Wo) has a direct-sum contravariant realization and discrete-time map with controlled comparable NN train- Wi 2 M(h;1) ⊕ M(h;d) ⊕ M(h;h) and contains h(1 + d + h) ing conditions, justified by both our theoretical analysis and free real parameters at maximum. During NN training, only experimental results. Our model has also been tested on real- the free parameters of each linear operator W are learnable, world time series datasets where the improvements range up while all s (i.e., activation functions) are fixed to be tanh, to 6:3%. sigmoid or other nonlinear functions. ct is known to suffer During the tensorization, we have explicitly embedded less from the vanishing gradient problem and thus captures many-body quantum state structures—a way of reducing the long-term memory better, while st tends to capture short-term exponentially large degree of freedom of a tensor (a.k.a. ten- dependence. sor decomposition)—from condensed matter physics, which Our tensorized LSTM architecture (Fig. 1) is exactly based is not uncommon for NN design24. Since a many-body on Eq. (2), from which the only change is entangled state living in a tensor-product Hilbert space is st = g(1;xt− ;st− ;Wo) ◦ g(T (s(ct ));W ): (3) hardly separable, this similarity motivates us to adopt a spe- 1 1 T cial measure of tensor complexity, namely, the entanglement g(T (s(ct ));WT ) is called a tensorized state propagation 25 entropy (EE) , which is commonly used in quantum physics function as WT : T ! G acts on a covariant tensor, and quantum information24. For one-dimensional many-body O O T (s(ct )) = 1 ⊕ q = (1 ⊕ W (s(ct ))): (4) states, two thoroughly studied, popular but different struc- t;l l l l tures exist: multiscale entanglement renormalization ansatz (MERA)26 or matrix product states (MPS)27, of which the EE Each Wl in Eq. (4) maps from the cell state ct to a new cov- scales with the subsystem size or not at all, respectively25. For ector qt;l 2 Q. Here, Q is named a local q-space, as by way most pertinent studies, MPS has been proved efficient enough of analogy it is considered encoding the local degree of free- to be applicable to a variety of tasks28–31. However, our exper- dom in quantum mechanics. Q can be extended to the com- iments show that, regarding our entanglement-embedded de- plex number field if necessary. Mathematically, Eq. (4) offers sign of the new tensorized LSTM architecture, LSTM-MERA the possibility of directly constructing orthogonal polynomi- als up to order L from s(ct ) to build up nonlinear complex- performs even better than LSTM-MPS in general without in- ⊗L ity. In fact, when L goes to infinity, T = lim (1 ⊕ Q) = creasing the number of parameters. Our finding leads to an- L!¥ other interesting result: we conjecture that not only should 1 ⊕ Q ⊕ Q ⊗ Q ⊕ ··· becomes a tensor algebra (up to a mul- tensorization be introduced, but the tensor’s EE has to scale tiplicative coefficient), and T (s(ct )) admits any nonlinear with the system size as well—and hence MERA is more effi- smooth function of ct . cient than MPS at learning chaos. Following the same procedure above, Eq. (4) is realized by choosing L independent realizations, Wl 2 M(P − 1;h), l = 1;2;··· ;L, which in total contain L(P − 1)h learnable pa- II. TENSORIZED STATE PROPAGATION rameters at maximum, 00 1 1 0 1 1 ··· 0 1 11 0[tanhc ] 1 A. Formalism t 1 linear + [q ] [q ] ··· [q ] [tanhc ] BB t 21C B t 22C B t 2LCC B t 2C padding BB[q ] C B[q ] C ··· B[q ] CC B . C −−−−−−! BB t 31C B t 32C B t 3LCC; The formalism starts from an operator-theoretical perspec- @ .