<<

Differentiable programming and its applications to dynamical systems

Adrian´ Hernandez´ and Jose´ M. Amigo´∗ Centro de Investigaci´onOperativa, Universidad Miguel Hern´andez,Avenida de la Universidad s/n, 03202 Elche, Spain

Abstract Differentiable programming is the combination of classical neural networks modules with algorithmic ones in an end- to-end differentiable model. These new models, that use automatic differentiation to calculate , have new learning capabilities (reasoning, attention and memory). In this tutorial, aimed at researchers in nonlinear systems with prior knowledge of , we present this new programming paradigm, describe some of its new features such as attention mechanisms, and highlight the benefits they bring. Then, we analyse the uses and limitations of traditional deep learning models in the modeling and prediction of dynamical systems. Here, a dynamical system is meant to be a set of state variables that evolve in time under general internal and external interactions. Finally, we review the advantages and applications of differentiable programming to dynamical systems. Keywords: Deep learning, differentiable programming, dynamical systems, attention, recurrent neural networks

1. Introduction To keep our exposition concise, this tutorial is aimed at researchers in nonlinear systems with prior knowl- The increase in computing capabilities together with edge of deep learning; see [5] for an excellent intro- new deep learning models has led to great advances in duction to the concepts and methods of deep learning. several tasks [1, 2, 3]. Therefore, this tutorial focuses right away on the lim- Deep learning architectures such as Recurrent Neural itations of traditional deep learning and the advantages Networks (RNNs) and Convolutional Neural Networks of differential programming, with special attention to its (CNNs), as well as the use of distributed representa- application to dynamical systems. By a dynamical sys- tions in natural language processing, have allowed to tem we mean here and hereafter a set of state variables take into account the symmetries and the structure of that evolve in time under the influence of internal and the problem to be solved. possibly external inputs. However, a major criticism of deep learning remains, Examples of differentiable programming techniques namely, that it only performs perception, mapping in- that have been successfully developed in recent years puts to outputs [4]. include A new direction to more general and flexible mod- (i) attention mechanisms [6], which allow the model els is differentiable programming, that is, the combina- to automatically search and learn which parts of a tion of geometric modules (traditional neural networks) source sequence are relevant to predict the target ele- with more algorithmic modules in an end-to-end differ- ment, arXiv:1912.08168v2 [math.DS] 2 May 2020 entiable model. As a result, differentiable programming (ii) self-attention, is a dynamic computational graph composed of differ- (iii) end-to-end Memory Networks [7], and entiable functions that provides not only perception but also reasoning, attention and memory. To efficiently cal- (iv) Differentiable Neural Computers (DNCs) [8], culate , this approach uses automatic differ- which are neural networks (controllers) with an exter- entiation, an algorithmic technique similar to backprop- nal read-write memory. agation and implemented in modern software packages As expected, in recent years there has been a grow- such as PyTorch, Julia, etc. ing interest in applying deep learning techniques to dy- namical systems. In this regard, RNNs and Long Short- Term Memories (LSTMs), specially designed for se- ∗Corresponding author quence modelling and temporal dependence, have been

Preprint submitted to Physica D May 5, 2020 successful in various applications to dynamical systems prediction, game playing and more) [1, 2, 3]. Inter- such as model identification and time series prediction estingly, deep learning models and architectures have [9, 10, 11]. evolved to take into account the structure of the prob- The performance of theses models (e.g. encoder- lem to be resolved. decoder networks), however, degrades rapidly as the Deep learning is a part of machine learning that length of the input sequence increases and they are not is based on neural networks and uses multiple layers, able to capture the dynamic (i.e., time-changing) inter- where each extracts higher level features from the dependence between time steps. The combination of input. RNNs are a special class of neural networks neural networks with new differentiable modules could where outputs from previous steps are fed as inputs to overcome some of those problems and offer new oppor- the current step [13, 14]. This recurrence makes them tunities and applications. appropriate for modelling dynamic processes and sys- Among the potential applications of differentiable tems. programming to dynamical systems let us mention CNNs are neural networks that alternate convolu- (i) attention mechanisms to select the relevant time tional and pooling layers to implement translational in- steps and inputs, variance [15]. They learn spatial hierarchies of features (ii) memory networks to store historical data from dy- through by using these building layers. namical systems and selectively use it for modelling and CNNs are being applied successfully to prediction, and and image processing [16]. (iii) the use of differentiable components in scientific Especially important is the use of distributed rep- computing. resentations as inputs to natural language processing Despite some achievements, more work is still needed pipelines. With this technique, the words of the vocab- to verify the benefits of these models over traditional ulary are mapped to an element of a with a networks. much lower dimensionality [17, 18]. This word embed- Thanks to software libraries that facilitate auto- ding is able to keep, in the learned vector space, some matic differentiation, differentiable programming ex- of the syntactic and semantic relationships presented in tends deep learning models with new capabilities (rea- the original data. soning, memory, attention, etc.) and the models can be Let us recall that, in a feedforward neural network efficiently coded and implemented. (FNN) composed of multiple layers, the output (without In the following sections of this tutorial we introduce the bias term) at layer l, see Figure 1, is defined as differentiable programming and explain in detail why it is an extension of deep learning (Section 2). We de- xl+1 = σ(Wl xl), (1) scribe some models based on this new approach such as attention mechanisms (Section 3.1), memory networks Wl being the weight at layer l. σ is the activation and differentiable neural computers (Section 3.2), and and xl+1, the output vector at layer l and the continuous learning (Section 3.3). Then we review the input vector at layer l + 1. The weight matrices for the use of deep learning in dynamical systems and their lim- different layers are the parameters of the model. itations (Section 4.1). And, finally, we present the new Learning is the mechanism by which the parame- opportunities that differentiable programming can bring ters of a neural network are adapted to the environment to the modelling, simulation and prediction of dynami- in the training process. This is an optimization prob- cal systems (Section 4.2). The conclusions and outlook lem which has been addressed by using -based are summarized in Section 5. methods, in which given a cost function f : Rn → R, ∗ the algorithm finds local minima w = arg minw f (w) updating each layer parameter wi j with the rule wi j := 2. From deep learning to differentiable program- wi j − η∇wi j f (w), η > 0 being the learning rate. ming In addition to regarding neural networks as universal approximators, there is no sound theoretical explanation In recent years, we have seen major advances in the for a good performance of deep learning. Several theo- field of machine learning. The combination of deep retical frameworks have been proposed: neural networks with the computational capabilities of Graphics Processing Units (GPUs) [12] has improved (i) As pointed out in [19], the class of functions of the performance of several tasks (image recognition, practical interest can be approximated with expo- machine translation, language modelling, time series nentially fewer parameters than the generic ones. 2 (i) Programs are directed acyclic graphs. (ii) Graph nodes are mathematical functions or vari- ables and the edges correspond to the flow of in- termediate values between the nodes. (iii) n is the number of nodes and l the number of input variables of the graph, with 1 ≤ l < n. vi for i ∈ {1, ..., n} is the variable associated with node i. (iv) E is the set of edges in the graph. For each (i, j) ∈ E we have i < j, therefore the graph is topologically ordered. (v) fi for i ∈ {(l+1), ..., n} is the differentiable function computed by node i in the graph. αi for i ∈ {(l + 1), ..., n} contains all input values for node i. (vi) The forward algorithm or pass, given input vari- Figure 1: Multilayer neural network. ables v1, ..., vl calculates vi = fi(αi) for i = {(l + 1), ..., n}. Symmetry, locality and compositionality proper- (vii) The graph is dynamically constructed and com- ties make it possible to have simpler neural net- posed of parametrizable functions that are differ- works. entiable and whose parameters are learned from (ii) From the point of view of information theory [20], data. an explanation has been put forward based on how Then, neural networks are just a class of these differ- much information each layer of the neural network entiable programs composed of classical blocks (feed- retains and how this information varies with the forward, recurrent neural networks, etc.) and new ones training and testing process. such as differentiable branching, attention, memories, Although deep learning can implicitly implement etc. logical reasoning [21], it has limitations that make it dif- Differentiable programming can be seen as a con- ficult to achieve more general intelligence [4]. Among tinuation of the deep learning end-to-end architectures these limitations, we can highlight the following: that have replaced, for example, the traditional linguistic components in natural language processing [23, 24]. To (i) It only performs perception, representing a map- efficiently calculate the derivatives in a , ping between inputs and outputs. this approach uses automatic differentiation, an algo- (ii) It follows a hybrid model where synaptic weights rithmic technique similar but more general than back- perform both processing and memory tasks but propagation. doesn’t have an explicit external memory. Automatic differentiation, in its reverse mode and in (iii) It does not carry out conscious and sequential rea- contrast to manual, symbolic and numerical differen- soning, a process that is based on perception and tiation, computes the derivatives in a two-step process memory through attention. [25, 26]. As described in [25] and rearranging the in- A path to a more general intelligence, as we will see dexes of the previous definition, a function f : Rn → Rm below, is the combination of geometric modules with is constructed with intermediate variables vi such that: more algorithmic modules in an end-to-end differen- (i) variables vi−n = xi, i = 1, ..., n are the inputs vari- tiable model. This approach, called differentiable pro- ables. gramming, adds new parametrizable and differentiable (ii) variables v , i = 1, ..., l are the intermediate vari- components to traditional neural networks. i ables. Differentiable programming, a broad term, is defined (iii) variables y = v , i = m − 1, ..., 0 are the output in [22] as a programming model (model of how a com- m−i l−i variables. puter program is executed), trainable with gradient de- scent, where neural networks are truly functional blocks In a first step, similar to the forward pass described with data-dependent branches and recursion. before, the computational graph is built populating in- Here, and for the purposes of this tutorial, we define termediate variables vi and recording the dependencies. differentiable programming as a programming model In a second step, called the backward pass, derivatives with the following characteristics: are calculated by propagating for the output y j being 3 ∂y j considered, the adjoints vi = from the output to the into optimizable modules. For instance, if the con- ∂vi inputs. dition a of an ”if” primitive (e.g., if a is satisfied do The reverse mode is more efficient to evaluate for y(x), otherwise do z(x)) is to be learned, it can be the functions with a large number of inputs (parameters) output of a neural network (linear transformation and and a small number of outputs. When f : Rn → R, a ) and the conditional primitive will as is the case in machine learning with n very large and transform into a weighted combination of both branches f the cost function, only one pass of the reverse mode ay(x) + (1 − a)z(x). Similarly, in an attention module, is necessary to compute the gradient ∇ f = ( ∂y , ..., ∂y ). different weights that are learned with the model are as- ∂x1 ∂xn In the last years, deep learning frameworks such signed to give a different influence to each part of the as PyTorch have been developed that provide reverse- input. Figure 2 shows the computational graph of a con- mode automatic differentiation [27]. The define-by-run ditional branching. philosophy of PyTorch, whose execution dynamically constructs the computational graph, facilitates the de- velopment of general differentiable programs. Differentiable programming is an evolution of classi- cal (traditional) software programming where, as shown in Table 1:

(i) Instead of specifying explicit instructions to the computer, an objective is set and an optimizable architecture is defined which allows to search in a subset of possible programs. (ii) The program is defined by the input-output data and not predefined by the user. (iii) The algorithmic elements of the program have to be differentiable, say, by converting them into dif- ferentiable blocks. Figure 2: Computational graph of differentiable branching. Classical Differentiable Sequence of instructions Sequence of diff. primitives The process of extending deep learning with differen- Fixed architecture Optimizable architecture tiable primitives would consist of the following steps: User defined Data defined Imperative programming Declarative programming Intuitive Abstract (i) Select a new function that improves the classical input-output transformation of deep learning, e.g. Table 1: Differentiable vs classical programming. attention, continuous learning, memories, etc. (ii) Convert this function into a directed acyclic graph, RNNs, for example, are an evolution of feedforward a sequence of parametrizable and differentiable networks because they are classical neural networks in- functions. For example, Figure 2 shows this se- side a for-loop (a control flow statement for iteration) quence of operations used in attention for differ- which allows the neural network to be executed repeat- entiable branching. edly with recurrence. However, this for-loop is a prede- (iii) Integrate this new function into the base model. fined feature of the model. Differentiable programming allows to dynamically constructs the graph and vary the length of the loop. Then, the ideal situation would be to In this way, using differentiable programming we can augment the neural network with programming primi- combine traditional perception modules (CNN, RNN, tives (for-loops, if branches, while statements, external FNN) with additional algorithmic modules that provide memories, logical modules, etc.) that are not predefined reasoning, abstraction and memory [28]. In the follow- by the user but are parametrizable by the training data. ing section we describe, by following this process, some The trouble is that many of these programming prim- examples of this approach that have been developed in itives are not differentiable and need to be converted recent years. 4 3. Differentiable learning and reasoning exp(score(q, ki)) αi = PT . (4) 3.1. Differentiable attention i0=1 exp(score(q, ki0 )) One of the aforementioned limitations of deep learn- The score function can be computed using a feedfor- ing models is that they do not perform conscious and ward neural network: sequential reasoning, a process that is based on percep- tion and memory through attention. score(q, ki) = Za tanh(Wa[q, ki])), (5) Reasoning is the process of consciously establishing as proposed in [6], where Z and W are matrices to and verifying facts combining attention with new or ex- a a be jointly learned with the rest of the model and [q, k ] isting information. An attention mechanism allows the i is a or concatenation of q and k . Also, brain to focus on one part of the input or memory (im- i in [31] the authors use a cosine similarity measure for age, text, etc), giving less attention to others. content-based attention, namely, Attention mechanisms have provided and will pro- vide a paradigm shift in machine learning. From tradi- score(q, k ) = cos((q, k )) (6) tional large-scale vector transformations to a more con- i i scious process that focuses only on a set of elements, where ((q, ki)) denotes the angle between q and ki. e.g. decomposing a problem into a sequence of atten- Then, differentiable attention can be seen as a sequen- tion based reasoning operations [29]. tial process of reasoning in which the task (query) is guided by a set of elements of the input source (or mem- ory) using attention. The attention process can focus on: (i) Temporal dimensions, e.g. different time steps of a sequence. (ii) Spatial dimensions, e.g. different regions of an im- age. (iii) Different elements of a memory. (iv) Different features or dimensions of an input vec- tor, etc. Figure 3: Attention diagram. Depending on where the process is initiated, we have:

One way to make this attention process differentiable (i) Top-down attention, initiated by the current task. is to make it a convex combination of the input or mem- (ii) Bottom-up, initiated spontaneously by the source ory, where all the steps are differentiable and the com- or memory. bination weights are parametrizable. As in [30], this differentiable attention process is de- 3.1.1. Attention mechanisms in seq2seq models scribed as mapping a query and a set of key-value pairs RNNs (see Figure 4) are a basic component of mod- to an output: ern deep learning architectures, especially of encoder- decoder networks. The following equations define the XT time evolution of an RNN: att(q, s) = αi(q, ki)Vi, (2) i=1 h ih hh ht = f (W xt + W ht−1), (7) where, as seen in figure 3, ki and Vi are the key and the o ho value vectors from the source/memory s, and q is the yt = f (W ht), (8) query vector. αi(q, ki) is the similarity function between Wih, Whh and Who being weight matrices. f h and f o are the query and the corresponding key and is calculated by the hidden and output activation functions while xt, ht applying the : and yt are the network input, hidden state and output. An evolution of RNNs are LSTMs [32], an RNN exp(zi) S o f tmax(zi) = P (3) structure with gated units, i.e. regulators. LSTM are 0 exp(z 0 ) i i composed of a cell, an input gate, an output gate and a to the score function score(q, ki): forget gate, and allow gradients to flow unchanged. The 5 at time t, m is the size of the hidden state and f1 is an RNN (or any of its variants).

(ii) A decoder, where st is the hidden state and whose initial state s0 is initialized with the last hidden state of the encoder hT . It generates the output o sequence Y = (y1, y2, ..., yT 0 ), yt ∈ R (the dimen- sion o depending on the task), with

yt = f2(st−1, yt−1), (10)

Figure 4: Temporal structure of a . where f2 is an RNN (or any of its variants) with an additional softmax layer. memory cell remembers values over arbitrary time inter- Because the encoder compresses all the information vals and the three gates regulate the flow of information of the input sequence in a fixed-length vector (the final into and out of the cell. hidden state hT ), the decoder possibly does not take into An encoder-decoder network maps an input sequence account the first elements of the input sequence. The use to a target one with both sequences of arbitrary length of this fixed-length vector is a limitation to improve the [2]. They have applications ranging from machine performance of the encoder-decoder networks. More- translation to time series prediction. over, the performance of encoder-decoder networks de- grades rapidly as the length of the input sequence in- creases [33]. This occurs in applications such as ma- chine translation and time series predition, where it is necessary to model long time dependencies. The key to solve this problem is to use an atten- tion mechanism. In [6] an extension of the basic encoder-decoder arquitecture was proposed by allowing the model to automatically search and learn which parts of a source sequence are relevant to predict the target element. Instead of encoding the input sequence in a fixed-length vector, it generates a sequence of vectors, choosing the most appropriate subset of these vectors during the decoding process. With the attention mechanism, the encoder is a bidi- −→ rectional RNN [34] with a forward hidden state h = −→ ←− ←− i f1( h i−1, xi) and a backward one hi = f1( h i+1, xi). The encoder state is represented as a simple concatenation Figure 5: An encoder-decoder network. of the two states, More specifically, this mechanism uses an RNN (or −→ ←− h = [h ; h ], (11) any of its variants, an LSTM or a GRU, Gated Recur- i i i rent Unit) to map the input sequence to a fixed-length with i = 1, ..., T. The encoder state includes both the vector, and another RNN (or any of its variants) to de- preceding and following elements of the sequence, thus code the target sequence from that vector (see Figure 5). capturing information from neighbouring inputs. Such a seq2seq model features normally an architecture The decoder has an output composed of:

(i) An encoder which, given an input sequence X = yt = f2(st−1, yt−1, ct) (12) n (x1, x2, ..., xT ) with xt ∈ R , maps xt to 0 for t = 1, ..., T . f2 is an RNN with an additional soft- max layer, and the input is a concatenation of yt−1 with ht = f1(ht−1, xt), (9) the context vector ct, which is a sum of hidden states of m where ht ∈ R is the hidden state of the encoder the input sequence weighted by alignment scores: 6 XT ct = αti hi. (13) i=1

Similar to equation (4), the weight αti of each state hi is calculated by

exp(score(st−1, hi)) αti = PT . (14) i0=1 exp(score(st−1, hi0 )) In this attention mechanism, the query is the state st−1 and the key and the value are the hidden states hi. The score measures how well the input at position i and the output at position t match. αti are the weights that im- plement the attention mechanism, defining how much of each input hidden state should be considered when de- Figure 7: A matrix of alignment scores. ciding the next state st and generating the output yt (see Figure 6). Another variant of attention are end-to-end memory networks [7], which we describe in Section 4.2.2 and are neural networks with a recurrent attention model over an external memory. The model, trained end-to-end, outputs an answer based on a set of inputs x1, x2, ..., xn stored in a memory and a query. Traditional computers are based on the von Neumann architecture which has two basic components: the CPU (Central Processing Unit), which carries out the pro- gram instructions, and the memory, which is accessed by the CPU to perform write/read operations. In con- trast, neural networks follow a hybrid model where synaptic weights perform both processing and memory tasks. Neural networks and deep learning models are good at mapping inputs to outputs but are limited in their abil- Figure 6: An encoder-decoder network with attention. ity to use facts from previous events and store useful information. Differentiable Neural Computers (DNCs) As we have described previously, the score function [8] try to overcome these shortcomings by combining can be parametrized using different alignment models neural networks with an external read-write memory. such as feedforward networks and the cosine similarity. As described in [8], a DNC is a neural network, called An example of a matrix of alignment scores can be the controller (playing the role of a differentiable CPU), seen in Figure 7. This matrix provides interpretability with an external memory, an N × W matrix. The DNC to the model since it allows to know which part (time- uses differentiable attention mechanisms to define dis- step) of the input is more important to the output. tributions (weightings) over the N rows and learn the importance each row has in a read or write operation. 3.2. Other attention mechanisms and differentiable To select the most appropriate memory components neural computers during read/write operations, a weighted sum w(i) is A variant of the attention mechanism is self-attention, used over the memory locations i = 1, ..., N. The atten- in which the attention component relates different posi- tion mechanism is used in three different ways: tions of a single sequence in order to compute a repre- sentation of the sequence. In this way, the keys, values (i) Access content (read or write) based on similarity. and queries come from the same source. The mecha- (ii) Time ordered access (temporal links) to recover nism can connect distant elements of the sequence more the sequences in the order in which they were writ- directly than using RNNs [35]. ten. 7 (iii) Dynamic memory allocation, where the DNC as-   signs and releases memory based on usage per-  X  y = tanh  (w + α H (t))y , (15) centage. j  i j i j i j i i∈inputs  At each time step, the DNC gets an input vector and emits an output vector that is a function of the combina- Hi j(t + 1) = ηyiy j + (1 − η)Hi j(t). (16) tion of the input vector and the memories selected. Then, during the initial training period, wi j and αi j DNCs, by combining the following characteristics, are trained using gradient descent and after this period, have very promising applications in complex tasks that the model keeps learning from ongoing experience. require both perception and reasoning:

(i) The classical perception capability of neural net- 4. Dynamical systems and differentiable program- works. ming (ii) Read and write capabilities based on content sim- ilarity and learned by the model. 4.1. Modeling dynamical systems with neural networks (iii) The use of previous knowledge to plan and reason. Dynamical systems deal with time-evolutionary pro- (iv) End-to-end differentiability of the model. cesses and their corresponding systems of equations. At any given time, a dynamical system has a state that (v) Implementation using software packages with au- can be represented by a point in a state space (mani- tomatic differentiation libraries such as PyTorch, fold). The evolutionary process of the dynamical sys- Tensorflow or similar. tem describes what future states follow from the current state. This process can be deterministic, if its entire fu- 3.3. Meta-plasticity and continuous learning ture is uniquely determined by its current state, or non- The combination of geometric modules (classical deterministic otherwise [38] (e.g., a random dynamical neural networks) with algorithmic ones adds new learn- system [39]). Furthermore, it can be a continuous-time ing capabilities to deep learning models. In the previ- process, represented by differential equations or, as in ous sections we have seen that one way to improve the this paper, a discrete-time process, represented by dif- learning process is by focusing on certain elements of ference equations or maps. Thus, the input or a memory and making this attention differ- entiable. ht = f (ht−1; θ) (17) Another natural way to improve the process of learn- for autonomous discrete-time deterministic dynamical ing is to incorporate differentiable primitives that add systems with parameters θ, and flexibility and adaptability. A source of inspiration is neuromodulators, which furnish the traditional synap- h = f (h , x ; θ) (18) tic transmission with new computational and processing t t−1 t capabilities [36]. for non-autonomous discrete-time deterministic dynam- Unlike the continuous learning capabilities of ani- ical systems driven by an external input xt. mal brains, which allow animals to adapt quickly to Dynamical systems have important applications in the experience, in neural networks, once the training physics, chemistry, economics, engineering, biology is completed, the parameters are fixed and the network and medicine [40]. They are relevant even in day-to-day stops learning. To solve this issue, in [37] a differen- phenomena with great social impact such as tsunami tiable plasticity component is attached to the network warning, earth temperature analysis and financial mar- that helps previously-trained networks adapt to ongoing kets prediction. experience. Dynamical systems that contain a very large number The process to introduce the differentiable plastic of variables interacting with each other in non-trivial component in the network is as follows. The activation ways are sometimes called complex (dynamical) sys- y j of neuron j has a conventional fixed weight wi j and tems [41]. Their behaviour is intrinsically difficult to a plastic component αi jHi j(t), where αi j is a structural model due to the dependencies and interactions between parameter tuned during the training period and Hi j(t) a their parts and they have emergence properties arising plastic component automatically updated as a function from these interactions such as adaptation, evolution, of ongoing inputs and outputs. The equations for the ac- learning, etc. tivation of y j with learning rate η, as described in [37], Here we consider discrete-time, deterministic and are: non-autonomous (i.e., the time evolution depending also 8 on exogenous variables) dynamical systems as well as To learn chaotic dynamics, recurrent radial basis the more general complex systems. Specifically, the dy- function (RBF) networks [48] and evolutionary algo- namical systems of interest range from systems of dif- rithms that generate RNNs have been proposed [49]. ference equations with multiple time delays to systems ”Nonlinear Autoregressive model with exogenous in- with a dynamic (i.e., time-changing) interdependence put” (NARX) [50] and boosted RNNs [51] have been between time steps. Notice that the former ones may applied to predict chaotic time series. be rewritten as higher dimensional systems with time However, a difficulty with RNNs is the vanishing gra- delay 1. dient problem [52]. RNNs are trained by unfolding On the other hand, in recent years deep learning mod- them into deep feedforward networks, creating a new els have been very successful in performing various layer for each time step of the input sequence. When tasks such as image recognition, machine translation, backpropagation computes the gradient by the chain game playing, etc. When the amount of training data rule, this gradient vanishes as the number of time-steps is sufficient and the distribution that generates the real increases. As a result, for long input-output sequences, data is the same as the distribution of the training data, as depicted in Figure 8, RNNs have trouble modelling these models perform extremely well and approximate long-term dependencies, that is, relationships between the input-output relation. elements that are separated by large periods of time. In view of the importance of dynamical systems for modeling physical, biological and social phenomena, there is a growing interest in applying deep learning techniques to dynamical systems. This can be done in different contexts, such as: (i) Modeling dynamical systems with known struc- ture and equations but non-analytical or complex solutions [42]. (ii) Modeling dynamical systems without knowledge of the underlying governing equations [43, 44]. In this regard, let us mention that commercial initia- tives are emerging that combine large amounts of meteorological data with deep learning models to improve weather predictions. Figure 8: Vanishing gradient problem in RNNs. Information sensitiv- (iii) Modeling dynamical systems with partial or noisy ity decays over time forgetting the first input. data [45]. A key aspect in modelling dynamical systems is tem- To overcome this problem, LSTMs were proposed. poral dependence. There are two ways to introduce it LSTMs have an advantage over basic RNNs due to their into a neural network [46]: relative insensitivity to temporal delays and, therefore, are appropriate for modeling and making predictions (i) A classical feedforward neural network with time based on time series whenever there exist temporary de- delayed states in the inputs but perhaps with an pendencies of unknown duration. With the appropriate unnecessary increase in the number of parameters. number of hidden units and activation functions [10], (ii) A recurrent neural network (RNN) which, as LSTMs can model and identify any non-linear dynami- shown in Equations (7) and (8), has a temporal re- cal system of the form: currence that makes it appropriate for modelling discrete dynamical systems of the form given in Equations (17) and (18). ht = f (xt, ..., xt−T , ht−1, ..., ht−T ), (19)

Thus, RNNs, specially designed for sequence mod- yt = g(ht), (20) elling [47], seem the ideal candidates to model, analyze and predict dynamical systems in the broad sense used f and g are the state and output functions while xt, ht in this tutorial. The temporal recurrence of RNNs, the- and yt are the system input, state and output. oretically, allows to model and identify dynamical sys- LSTMs have succeeded in various applications to dy- tems described with equations with any temporal depen- namical systems such as model identification and time dence. series prediction [9, 10, 11]. 9 An also remarkable application of the LSTM has applying this mechanism to dynamical systems model- been machine translation [2, 53], using the encoder- ing or prediction, it is necessary to decide the following decoder architecture described in Section 3.1.1. aspects: However, as we have seen, the decoder possibly does not take into account the first elements of the input se- (i) In which phase or phases of the model should the quence because the encoder compresses all the informa- attention mechanism be introduced? tion of the input sequence in a fixed-length vector. Then, (ii) What dimension is the mechanism going to focus the performance of encoder-decoder networks degrades on? Temporal, spatial, etc. rapidly as the length of input sequence increases and (iii) What parts of the system will correspond to the this can be a problem in time series analysis, where pre- query, the key and the value? dictions are based upon a long segment of the series. Furthermore, as depicted in Figure 9, a complex One option, which is also quite illustrative, is to use dynamic may feature interdependencies between time a dual-stage attention, an encoder with input attention steps that vary with time. In this situation, the equation and a decoder with temporal attention, as pointed out in that defines the temporal evolution may change at each [54]. t ∈ 1, ..., T. For these dynamical systems, adding an Here we describe this option, in which the first stage attention module like the one described in Equation 13 extracts the relevant input features and the second se- can help model such time-changing interdependencies. lects the relevant time steps of the model. In many dy- namical systems there are long term dependencies be- tween time steps and these dependencies can be dy- namic, i.e., time-changing. In these cases, attention mechanisms learn to focus on the most relevant parts of the system input or state. n X = (x1, x2, ..., xT ) with xt ∈ R represents the input sequence. T is the length of the time interval and n the number of input features or dimensions. At each time 1 2 n step t, xt = (xt , xt , ..., xt ).

Encoder with input attention Figure 9: Temporal interdependencies in a dynamical system. The encoder, given an input sequence X, maps ut to

h = f (h , u ), (21) 4.2. Improving dynamical systems with differentiable t 1 t−1 t m programming where ht ∈ R is the hidden state of the encoder at Deep learning models together with graphic proces- time t, m is the size of the hidden state and f1 is an sors and large amounts of data have improved the mod- RNN (or any of its variants). xt is replaced by ut, which eling of dynamical systems but this has some limitations adaptively selects the relevant input features with such as those mentioned in the previous section. The 1 1 2 2 n n combination of neural networks with new differentiable ut = (αt xt , αt xt , ..., αt xt ). (22) algorithmic modules is expected to overcome some of αk is the attention weight measuring the importance those shortcomings and offer new opportunities and ap- t of the k input feature at time t and is computed by plications. In the next three subsections we illustrate with exam- k k exp(score(ht−1, x )) ples the kind of applications of differentiable program- αt = , (23) PT exp(score(h , xi)) ming to dynamical systems we have in mind, namely: i=1 t−1 implementations of attention mechanisms, memory net- k k k k where x = (x1, x2, ..., xT ) is the k input feature series works, scientific simulations and modeling in physics. and the score function can be computed using a feed- forward neural network, a cosine similarity measure or 4.2.1. Attention mechanisms in dynamical systems other similarity functions. In the previous sections we have described the atten- Then, this first attention stage extracts the relevant in- tion mechanism, which allows a task to be guided by a put features, as seen in Figure 10 with the corresponding set of elements of the input or memory source. When query, keys and values. 10 Figure 11: Diagram of the input attention mechanism. Figure 10: Diagram of the input attention mechanism. Despite the theoretical advantages and some achieve- Decoder with temporal attention ments, further studies are needed to verify the benefits Similar to the attention decoder described in Section of the attention mechanism over traditional networks. 3.1.1, the decoder has an output 4.2.2. Memory networks yt = f2(st−1, yt−1, ct) (24) Memory networks allow long-term dependencies in 0 for t = 1, ..., T . f2 is an RNN (or any of its variants) sequential data to be learned thanks to an external mem- with an additional linear or softmax layer, and the in- ory component. Instead of taking into account only the most recent states, memory networks consider the entire put is a concatenation of yt−1 with the context vector ct, which is a sum of hidden states of the input sequence list of entries or states. weighted by alignment scores: Here we define one possible application of memory networks to dynamical systems, following an approach T based on [7]. We are given a time series of histori- X i ct = β hi. (25) n t cal data n1, ..., nT 0 with ni ∈ R and the input series i=1 n x1, ..., xT with xt ∈ R the current input, which is the i The weight βt of each state hi is computed using the query. similarity function, score(st−1, hi), and applying a soft- The set {ni} are converted into memory vectors {mi} max function, as described in Section 3.1.1. and output vectors {ci} of dimension d. The query xt is This second attention stage selects the relevant time also transformed to obtain a internal state ut of dimen- steps, as shown in Figure 11 with the corresponding sion d. These transformations correspond to a linear query, keys and values. transformation: Ani = mi, Bni = ci, Cxt = ut, being A, B, C parameterizable matrices. Further remarks A match between ut and each memory vector mi is In [54], the authors define this dual-stage attention computed by taking the inner product followed by a RNN and show that the model outperforms a classical softmax function: model in time series prediction. i T In [55], a comparison is made between LSTMs and pt = S o f tmax(ut mi). (26) attention mechanisms for financial time series forecast- The final vector from the memory, o , is a weighted ing. It is shown there that an LSTM with attention per- t sum over the transformed inputs {c }: form better than stand-alone LSTMs. i

A temporal attention layer is used in [56] to select rel- X i evant information and to provide model interpretability, ot = pt ci. (27) an essential feature to understand deep learning models. j Interpretability is further studied in detail in [57], con- To generate the final prediction yt, a linear layer is cluding that attention weights partially reflect the im- applied to the sum of the output vector ot and the trans- pact of the input elements on model prediction. formed input ut and to the previous output yt−1: 11 In this way, the model can continuously learn be- 1 2 yt = W (ot + ut) + W yt−1 (28) cause the plastic component is updated by neural activity. This model is differentiable end-to-end by learning i DiffTaichi, a differentiable programming language the matrices (the final matrices W ant the three transfor- for building differentiable physical simulations, mation matrices A, B and C) to minimize the prediction is proposed in [62], integrating a neural network error. controller with a physical simulation module. In [58] the authors propose a similar model based on A differentiable physics engine is presented in memory networks with a memory component, three en- [63]. The system simulates rigid body dynamics coders and an autoregressive component for multivari- and can be integrated in an end-to-end differen- ate time-series forecasting. Compared to non-memory tiable deep learning model for learning the physi- RNN models, their model is better at modeling and cap- cal parameters. turing long-term dependencies and, moreover, it is in- (ii) Differentiable ODE solvers. terpretable. As described in [60], an ODE can be embedded Taking advantage of the highlighted capabilities of into a deep learning model. For example, the Euler Differentiable Neural Computers (DNCs), an enhanced method takes in the function and the ini- DNC for electroencephalogram (EEG) data analysis is tial values and outputs the approximated solution. proposed in [59]. By replacing the LSTM network con- The derivative function could be a neural network. troller with a recurrent convolutional network, the po- This solver is differentiable and can be integrated tential of DNCs in EEG signal processing is convinc- into a lager model that can be optimized using gra- ingly demonstrated. dient descent. In [61] a differentiable model of a trebuchet is de- 4.2.3. Scientific simulation and physical modeling scribed. In a classical trebuchet model, the param- Scientific modeling, as pointed out in [60], has tradi- eters (the mass of the counterweight and the angle tionally employed three approaches: of release) are fed into an ODE solver that cal- culates the distance, which is compared with the (i) Direct modeling, if the exact function that relates target distance. input and output is known. In the extended model, a neural network is intro- (ii) Using a machine learning model. As we have duced. The network takes two inputs, the target mentioned, neural networks are universal approx- distance and the current wind speed, and outputs imators. the trebuchet parameters, which are fed into the (iii) Using a differential equation if some structure of simulator to calculate the distance. This distance the problem is known. For example, if the rate of is compared with the target distance and the er- change of the unknown function is a function of ror is back-propagated through the entire model to the physical variables. optimize the parameters of the network. Then, the Machine learning models have to learn the input- neural network is optimized so that the model can output transformation from scratch and need a lot of achieve any target distance. Using this extended data. One way to make them more efficient is to com- model is faster than optimizing only the trebuchet. bine them with a differentiable component suited to a This type of applications shows how combin- specific problem. This component allows specific prior ing differentiable ODE solvers and deep learning knowledge to be incorporated into deep learning models models allows to incorporate previous structure to and can be a differentiable physical model or a differen- the problem and makes the learning process more tiable ODE (ordinary differential equation) solver. efficient. We may conclude that combining scientific com- (i) Differentiable physical models. puting and differentiable components will open Differentiable plasticity, as described in Section new avenues in the coming years. 3.3, can be applied to deep learning models of dy- namical systems in order to help them adapt to on- 5. Conclusions and future directions going data and experience. As done in [37], the plasticity component de- Differentiable programming is the use of new differ- scribed in Equations 15 and 16, can be introduced entiable components beyond classical neural networks. in some layers of the deep learning architecture. This generalization of deep learning allows to have data 12 parametrizable architectures instead of pre-fixed ones [8] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, and new learning capabilities such as reasoning, atten- A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, tion and memory. T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, The first models created under this new paradigm, K. Kavukcuoglu, D. Hassabis, Hybrid computing using a neu- such as attention mechanisms, differentiable neural ral network with dynamic external memory, Nature 538 (2016) computers and memory networks, are already having a 471–476. [9] Z. Wang, D. Xiao, F. Fang, R. Govindan, C. Pain, Y. Guo, Model great impact on natural language processing. identification of reduced order fluid dynamics systems using These new models and differentiable programming deep learning, International Journal for Numerical Methods in are also beginning to improve machine learning appli- Fluids 86. doi:10.1002/fld.4416. cations to dynamical systems. As we have seen, these [10] Y.Wang, A new concept using lstm neural networks for dynamic system identification, 2017, pp. 5324–5329. doi:10.23919/ models improve the capabilities of RNNs and LSTMs ACC.2017.7963782. in identification, modeling and prediction of dynamical [11] Y. Li, H. Cao, Prediction for tourism flow based on lstm neu- systems. They even add a necessary feature in machine ral network, Procedia Computer Science 129 (2018) 277–283. learning such as interpretability. doi:10.1016/j.procs.2018.03.076. [12] O. Yadan, K. Adams, Y. Taigman, M. Ranzato, Multi-gpu train- However, this is an emerging field and further re- ing of convnets, CoRR abs/1312.5853. search is needed in several directions. To mention a few: [13] A. Graves, M. Liwicki, S. Fernandez,´ R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained (i) More comparative studies between attention , IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 855–868. mechanisms and LSTMs in predicting dynamical [14] A. Sherstinsky, Fundamentals of recurrent neural network systems. (rnn) and long short-term memory (lstm) network, ArXiv (ii) Use of self-attention and its possible applications abs/1808.03314. [15] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based to dynamical systems. learning applied to document recognition, Proceedings of the (iii) As with RNNs, a theoretical analysis (e.g., in the IEEE 86 (1998) 2278 – 2324. doi:10.1109/5.726791. framework of dynamical systems) of attention and [16] R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolu- tional neural networks: an overview and application in radiol- memory networks. ogy, in: Insights into imaging, 2018. (iv) Clear guidelines so that scientists without ad- [17] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neural vanced knowledge of machine learning can use probabilistic language model, J. Mach. Learn. Res. 3 (2003) 1137–1155. new differentiable models in computational sim- URL http://dl.acm.org/citation.cfm?id=944919. ulations. 944966 [18] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their Acknowledgments. This work was financially sup- compositionality, in: Proceedings of the 26th International ported by the Spanish Ministry of Science, Inno- Conference on Neural Information Processing Systems - vation and Universities, grant MTM2016-74921-P Volume 2, NIPS’13, Curran Associates Inc., USA, 2013, pp. 3111–3119. (AEI/FEDER, EU). URL http://dl.acm.org/citation.cfm?id=2999792. 2999959 [19] H. W. Lin, M. Tegmark, Why does deep and cheap learning References work so well?, Journal of Statistical Physics doi:10.1007/ s10955-017-1836-5. [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 [20] R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information, ArXiv abs/1703.00810. (2015) 436–44. doi:10.1038/nature14539. [2] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learn- [21] P. Hohenecker, T. Lukasiewicz, Ontology reasoning with deep ing with neural networks, in: NIPS, 2014. neural networks, ArXiv abs/1808.07980. [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, [22] F. Wang, Backpropagation with continuation callbacks : Foun- A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, dations for efficient and expressive differentiable programming, Y. Chen, T. P. Lillicrap, F. F. C. Hui, L. Sifre, G. van den Driess- NIPS’18, 2018. che, T. Graepel, D. Hassabis, Mastering the game of go without [23] L. Deng, Y. Liu, A Joint Introduction to Natural Language Pro- human knowledge, Nature 550 (2017) 354–359. cessing and to Deep Learning, Springer Singapore, Singapore, [4] G. Marcus, Deep learning: A critical appraisal, ArXiv 2018, pp. 1–22. abs/1801.00631. [24] Y. Goldberg, Neural network methods for natural lan- [5] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT guage processing, Synthesis Lectures on Human Lan- guage Technologies 10 (2017) 1–309. Press, 2016, http://www.deeplearningbook.org. doi:10.2200/ [6] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by S00762ED1V01Y201703HLT037. jointly learning to align and translate, ArXiv 1409. [25] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, [7] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, End-to-end Automatic differentiation in machine learning: a survey, Journal memory networks, in: NIPS, 2015. of Machine Learning Research 18 (153) (2018) 1–43. 13 URL http://jmlr.org/papers/v18/17-468.html dynamical systems using neural networks, IEEE transactions on [26] F. Wang, X. Wu, G. M. Essertel, J. M. Decker, T. Rompf, De- neural networks 1 1 (1990) 4–27. mystifying differentiable programming: Shift/reset the penulti- [47] B. Chang, M. Chen, E. Haber, E. H. Chi, AntisymmetricRNN: mate backpropagator, ArXiv abs/1803.10228. A dynamical system view on recurrent neural networks, in: In- [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, ternational Conference on Learning Representations, 2019. Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differen- URL https://openreview.net/forum?id=ryxepo0cFX tiation in , in: NIPS-W, 2017. [48] T. Miyoshi, H. Ichihashi, S. Okamoto, T. Hayakawa, Learning [28] F. Yang, Z. Yang, W. W. Cohen, Differentiable learning of chaotic dynamics in recurrent rbf network, 1995, pp. 588 – 593 logical rules for knowledge base reasoning (2017) 2316–2325. vol.1. doi:10.1109/ICNN.1995.488245. URL http://dl.acm.org/citation.cfm?id=3294771. [49] Y. Sato, S. Nagaya, Evolutionary algorithms that generate re- 3294992 current neural networks for learning chaos dynamics, in: Pro- [29] D. A. Hudson, C. D. Manning, Compositional attention net- ceedings of IEEE International Conference on Evolutionary works for machine reasoning, in: Proceedings of the Interna- Computation, 1996, pp. 144–149. doi:10.1109/ICEC.1996. tional Conference on Learning Representations (ICLR), 2018. 542350. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [50] E. Diaconescu, The use of narx neural networks to predict Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: chaotic time series, WSEAS Transactions on Computer Re- NIPS, 2017. search 3. [31] A. Graves, G. Wayne, I. Danihelka, Neural turing machines, [51] M. Assaad, R. Bon, H. Cardot, Predicting chaotic time series by ArXiv abs/1410.5401. boosted recurrent neural networks, Vol. 4233, 2006, pp. 831– [32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural 840. doi:10.1007/11893257\_92. computation 9 (1997) 1735–80. doi:10.1162/neco.1997. [52] Y. Bengio, P. Simard, P. Frasconi, Learning long-term depen- 9.8.1735. dencies with gradient descent is difficult, IEEE transactions on [33] K. Cho, B. van Merrienboer,¨ D. Bahdanau, Y. Bengio, On the neural networks / a publication of the IEEE Neural Networks properties of neural machine translation: Encoder–decoder ap- Council 5 (1994) 157–66. doi:10.1109/72.279181. proaches, in: Proceedings of SSST-8, Eighth Workshop on Syn- [53] K. Cho, B. van Merrinboer, C. Gulcehre, F. Bougares, tax, Semantics and Structure in Statistical Translation, Associa- H. Schwenk, Y. Bengio, Learning phrase representations us- tion for Computational Linguistics, Doha, Qatar, 2014, pp. 103– ing rnn encoder-decoder for statistical machine translationdoi: 111. doi:10.3115/v1/W14-4012. 10.3115/v1/D14-1179. URL https://www.aclweb.org/anthology/W14-4012 [54] Y. Qin, D. Song, H. Cheng, W. Cheng, G. Jiang, G. W. Cottrell, [34] A. Graves, N. Jaitly, A. rahman Mohamed, Hybrid speech A dual-stage attention-based recurrent neural network for time recognition with deep bidirectional lstm, 2013 IEEE Workshop series prediction, ArXiv abs/1704.02971. on Automatic and Understanding (2013) [55] T. Hollis, A. Viscardi, S. E. Yi, A comparison of lstms and at- 273–278. tention mechanisms for forecasting financial time series, ArXiv [35] G. Tang, M. Muller,¨ A. Rios, R. Sennrich, Why self-attention? a abs/1812.07699. targeted evaluation of neural machine translation architectures, [56] P. Vinayavekhin, S. Chaudhury, A. Munawar, D. J. Agravante, in: EMNLP, 2018. G. D. Magistris, D. Kimura, R. Tachibana, Focusing on what [36] A. Hernandez, J. M. Amigo,´ Multilayer adaptive networks in is relevant: Time-series learning and understanding using atten- neuronal processing, The European Physical Journal Special tion, 2018 24th International Conference on Topics 227 (2018) 1039–1049. (ICPR) (2018) 2624–2629. [37] T. Miconi, K. O. Stanley, J. Clune, Differentiable plastic- [57] S. Serrano, N. A. Smith, Is attention interpretable?, in: ACL, ity: training plastic neural networks with backpropagation, in: 2019. ICML, 2018. [58] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, S. de Lin, A memory-network [38] G. Layek, An Introduction to Dynamical Systems and Chaos, based solution for multivariate time-series forecasting, ArXiv 2015. doi:10.1007/978-81-322-2556-0. abs/1809.02105. [39] L. Arnold, Random Dynamical Systems, 2003. [59] Y. Ming, D. Pelusi, C.-N. Fang, M. Prasad, Y.-K. Wang, D. Wu, [40] T. Jackson, A. Radunskaya, Applications of Dynamical Sys- C.-T. Lin, Eeg data analysis with stacked differentiable neu- tems in Biology and Medicine, Vol. 158, 2015. doi:10.1007/ ral computers, Neural Computing and Applications doi:10. 978-1-4939-2782-1. 1007/s00521-018-3879-1. [41] C. Gros, Complex and adaptive dynamical systems. A primer. [60] C. Rackauckas, M. Innes, Y. Ma, J. Bettencourt, L. White, 3rd ed, Vol. 1, 2008. doi:10.1063/1.3177233. V.Dixit, Diffeqflux.jl - a julia library for neural differential equa- [42] S. Pan, K. Duraisamy, Long-time predictive modeling of nonlin- tions, ArXiv abs/1902.02376. ear dynamical systems using neural networks, Complexity 2018 [61] M. Innes, A. Edelman, K. Fischer, C. Rackauckus, E. Saba, (2018) 4801012:1–4801012:26. V. Shah, W. Tebbutt, Zygote: A differentiable programming sys- [43] P. Dben, P. Bauer, Challenges and design choices for global tem to bridge machine learning and scientific computing, ArXiv weather and climate models based on machine learning, Geo- abs/1907.07587. scientific Model Development 11 (2018) 3999–4009. doi: [62] Y. Hu, L. Anderson, T.-M. Li, Q. Sun, N. Carr, J. Ragan-Kelley, 10.5194/gmd-11-3999-2018. F. Durand, Difftaichi: Differentiable programming for physical [44] K. Chakraborty, K. G. Mehrotra, C. K. Mohan, S. Ranka, Fore- simulation, ArXiv abs/1910.00935. casting the behavior of multivariate time series using neural net- [63] F. d. A. Belbute-Peres, K. A. Smith, K. R. Allen, J. B. works, Neural Networks 5 (1992) 961–970. Tenenbaum, J. Z. Kolter, End-to-end differentiable physics for [45] K. Yeo, I. Melnyk, Deep learning algorithm for data-driven sim- learning and control, in: Proceedings of the 32Nd Interna- ulation of noisy dynamical system, Journal of Computational tional Conference on Neural Information Processing Systems, Physics 376 (2019) 1212 – 1231. doi:https://doi.org/10. NIPS’18, Curran Associates Inc., USA, 2018, pp. 7178–7189. 1016/j.jcp.2018.10.024. URL http://dl.acm.org/citation.cfm?id=3327757. [46] K. S. Narendra, K. Parthasarathy, Identification and control of 3327820

14