Arxiv:2010.07103V2 [Physics.Data-An] 4 Dec 2020 Often Shows Superior Performance
Total Page:16
File Type:pdf, Size:1020Kb
Breaking Symmetries of the Reservoir Equations in Echo State Networks Joschka Herteux1, a) and Christoph R¨ath1, b) Institut f¨urMaterialphysik im Weltraum, Deutsches Zentrum f¨urLuft- und Raumfahrt, M¨unchnerStr. 20, 82234 Wessling, Germany (Dated: 7 December 2020) Reservoir computing has repeatedly been shown to be extremely successful in the prediction of nonlinear time-series. However, there is no complete understanding of the proper design of a reservoir yet. We find that the simplest popular setup has a harmful symmetry, which leads to the prediction of what we call mirror- attractor. We prove this analytically. Similar problems can arise in a general context, and we use them to explain the success or failure of some designs. The symmetry is a direct consequence of the hyperbolic tangent activation function. Further, four ways to break the symmetry are compared numerically: A bias in the output, a shift in the input, a quadratic term in the readout, and a mixture of even and odd activation functions. Firstly, we test their susceptibility to the mirror-attractor. Secondly, we evaluate their performance on the task of predicting Lorenz data with the mean shifted to zero. The short-time prediction is measured with the forecast horizon while the largest Lyapunov exponent and the correlation dimension are used to represent the climate. Finally, the same analysis is repeated on a combined dataset of the Lorenz attractor and the Halvorsen attractor, which we designed to reveal potential problems with symmetry. We find that all methods except the output bias are able to fully break the symmetry with input shift and quadratic readout performing the best overall. Reservoir computing describes a kind of recurrent devices of daily living. But the application of ML also neural network, which has been very successful pervades more and more areas of science including re- in the prediction of chaotic systems. However, search on complex systems. For a very recent collection the details of its inner workings have yet to be see1 and references therein. fully understood. One important aspect of any In nonlinear dynamics ML-based methods have recently neural network is the activation function. Even attracted a lot of attention, because it was demonstrated though its effects have been extensively studied that the exact short term prediction of nonlinear sys- in other Machine Learning techniques, there are tem can be significantly improved. Furthermore, it was still open questions in the context of reservoir shown that ML techniques also allow for a very accu- computing. Our research aims to fill this gap. rate reproduction of the long term properties ("the cli- We prove analytically that an antisymmetric ac- mate") of complex systems.2,3 Several ML methods like tivation function like the hyperbolic tangent leads deep feed-forward artificial neural network (ANN), recur- to a disastrous symmetry in a popular setup we rent neural network (RNN) with long short-term memory call simple ESN. This leads the reservoir to learn (LSTM) and reservoir computing (RC) fulfill the predic- an inverted version of the training data we call tion tasks.4,5 RC has attracted most attention. It is a mirror-attractor, which we demonstrate numeri- machine learning method that has been independently cally. This heavily perturbs any prediction, es- discovered as Liquid State Machines (LSM) by Maass6 pecially if the mirror-attractor overlaps with the and as echo state networks (ESN) by Jaeger.7 We focus real attractor. Further, we compare four different here on the ESN approach, which falls under the cat- ways to break the symmetry. We test numerically egory of Recurrent Neural Networks (RNN). The main if they tend to learn the mirror-attractor and test difference to other RNNs such as LSTMs is that in RC their performance on two tasks where the simple only the last layer is explicitly trained via linear regres- ESN fails. We find that three of them are able to sion. Instead of hidden layers it uses a so-called reservoir, fully break the symmetry. which in the case of the ESN is typically a network with recurrent connections. The popularity of RC has several reasons. First, RC arXiv:2010.07103v2 [physics.data-an] 4 Dec 2020 often shows superior performance. Second, ESNs of- I. INTRODUCTION fer conceptual advantages. As only the output layer is explicitly trained, the number of weights to be ad- Machine Learning (ML) has shown to be tremendously justed is very small. Thus, the training of ESNs is successful in categorization and recognition tasks and the comparably transparent, extremely CPU-efficient (orders use of ML algorithms has become common in technical of magnitude faster than for ANNs) and the vanishing- gradient-problem is circumvented. Furthermore, small, smart and energy-efficient hardware implementations us- ing photonic systems,8 spintronic systems9 and many a)Electronic mail: [email protected] more are conceivable and being developed10 (and refer- b)Electronic mail: [email protected] 2 ences therein). the space populated by the trajectory.16 The correlation Ongoing research is focused on identifying the necessary dimension is based on the discrete form of the correlation conditions for a good reservoir. Recent studies focused integral mainly on the influence of the size and topology of the N reservoir on the prediction capabilities.11{15 Less atten- 1 X C(r) = lim θ(r − jxi − xjj) ; (2) tion has so far been paid on the role of activation and N!1 N 2 i;j=1 onto the overall performance of RC. In this paper we study in detail the sensitivity of RC to which returns the fraction of pairs of points that are symmetries in the activation function. We reveal that closer than the threshold distance r. θ represents the previously reported shortcomings for simple ESNs can Heaviside function. The correlation dimension is then unambiguously be attributed to symmetry properties of defined by the relation the activation function (and not of the input signal). We C(r) / rν (3) propose and assess four different methods to break the symmetries that were developed for obtaining more reli- as the scaling exponent ν. For a self-similar strange at- able prediction results. tractor this relation holds in some range of r, which needs The paper is organized as follows: Sec. II first discusses to be properly calibrated. Here we adjusted it for every the different measures used and the two test systems: the given problem beforehand on simulated data, which is Lorenz and the Halvorsen equations. Afterwards the dif- not used for training or testing. We note that precision ferent ESN designs used in this study are introduced and is not essential here, since we are only interested in com- the symmetry of the simple ESN is proven. In Sec. III parisons and not in absolute values. the three numerical experiments we conducted and their To get the correlation dimension for a given dataset we results are presented. Finally we discuss our findings in use the Grassberger Procaccia algorithm.17 Sec. IV. 3. Largest Lyapunov Exponent II. METHODS The second measure we use to evaluate the climate is A. Measures and System Characteristics the largest Lyapunov exponent. In contrast to the corre- lation dimension it is indicative of the development of the 1. Forecast Horizon system in time. A d-dimensional chaotic system is char- acterized by d Lyapunov exponents of which at least one As in13,14 we use the forecast horizon to measure the is positive. They describe the average rate of exponen- quality of short-time predictions. It is defined as the time tial growth of a small perturbation in each direction in between the start of a prediction and the point where it phase space. The largest Lyapunov exponent λ is the one deviates from the test data more than a fixed threshold. associated with the direction of the fastest divergence. The exact condition reads d(t) = Ceλt : (4) jv(t) − vR(t)j > δ ; (1) Since it dominates the dynamics it has a special signifi- where the norm is taken elementwise. Due to the chaotic cance. It can be calculated from data with relative ease 18 nature of our training data, any small perturbation will by using the Rosenstein algorithm. It is also possible usually grow exponentially with time. Thus, this indi- to determine the complete Lyapunov spectrum from the cates the end of a reliable prediction of the actual tra- equations, which we have access to for our testdata as 3,19 jectory. The measure is not very sensitive to the exact well as for our ESNs. However, we found that the value of the threshold for the same reason. comparison is clearer in our case with the data-driven The threshold generally depends on the direction and we approach, because it is completely independent of details use δ = (5:8; 8:0; 6:9)T for the Lorenz system without of the system, e.g. the question if it is discrete or contin- preprocessing. In general the values of δ are chosen to be uous. This method is also computationally less costly. 1 approximately 15% of the spatial extent of the respective We can further define the Lyapunov time τλ = λ as char- 1 attractor in the given direction. This is useful if the dy- acteristic timescale of a system. We use τλ = 0:87 ≈ 1:15 1 namics of a system takes place on different lengthscales. for the Lorenz system and τλ = 0:74 ≈ 1:35 for the Halvorsen system, based on our measurements in table II. 2. Correlation Dimension To evaluate the climate of a prediction we use two mea- B. Lorenz and Halvorsen system sures.