Differentiable Programming and Its Applications to Dynamical Systems

Differentiable programming and its applications to dynamical systems Adrian´ Hernandez´ and Jose´ M. Amigo´∗ Centro de InvestigaciónOperativa, Universidad Miguel Hernández,Avenida de la Universidad s/n, 03202 Elche, Spain Abstract Differentiable programming is the combination of classical neural networks modules with algorithmic ones in an end- to-end differentiable model. These new models, that use automatic differentiation to calculate gradients, have new learning capabilities (reasoning, attention and memory). In this tutorial, aimed at researchers in nonlinear systems with prior knowledge of deep learning, we present this new programming paradigm, describe some of its new features such as attention mechanisms, and highlight the benefits they bring. Then, we analyse the uses and limitations of traditional deep learning models in the modeling and prediction of dynamical systems. Here, a dynamical system is meant to be a set of state variables that evolve in time under general internal and external interactions. Finally, we review the advantages and applications of differentiable programming to dynamical systems. Keywords: Deep learning, differentiable programming, dynamical systems, attention, recurrent neural networks 1. Introduction To keep our exposition concise, this tutorial is aimed at researchers in nonlinear systems with prior knowl- The increase in computing capabilities together with edge of deep learning; see [5] for an excellent intro- new deep learning models has led to great advances in duction to the concepts and methods of deep learning. several machine learning tasks [1, 2, 3]. Therefore, this tutorial focuses right away on the lim- Deep learning architectures such as Recurrent Neural itations of traditional deep learning and the advantages Networks (RNNs) and Convolutional Neural Networks of differential programming, with special attention to its (CNNs), as well as the use of distributed representa- application to dynamical systems. By a dynamical sys- tions in natural language processing, have allowed to tem we mean here and hereafter a set of state variables take into account the symmetries and the structure of that evolve in time under the influence of internal and the problem to be solved. possibly external inputs. However, a major criticism of deep learning remains, Examples of differentiable programming techniques namely, that it only performs perception, mapping in- that have been successfully developed in recent years puts to outputs [4]. include A new direction to more general and flexible mod- (i) attention mechanisms [6], which allow the model els is differentiable programming, that is, the combina- to automatically search and learn which parts of a tion of geometric modules (traditional neural networks) source sequence are relevant to predict the target ele- with more algorithmic modules in an end-to-end differ- ment, arXiv:1912.08168v2 [math.DS] 2 May 2020 entiable model. As a result, differentiable programming (ii) self-attention, is a dynamic computational graph composed of differ- (iii) end-to-end Memory Networks [7], and entiable functions that provides not only perception but also reasoning, attention and memory. To efficiently cal- (iv) Differentiable Neural Computers (DNCs) [8], culate derivatives, this approach uses automatic differ- which are neural networks (controllers) with an exter- entiation, an algorithmic technique similar to backprop- nal read-write memory. agation and implemented in modern software packages As expected, in recent years there has been a grow- such as PyTorch, Julia, etc. ing interest in applying deep learning techniques to dynamical systems. In this regard, RNNs and Long Short- Term Memories (LSTMs), specially designed for se- ∗Corresponding author quence modelling and temporal dependence, have been Preprint submitted to Physica D May 5, 2020 successful in various applications to dynamical systems prediction, game playing and more) [1, 2, 3]. Inter- such as model identification and time series prediction estingly, deep learning models and architectures have [9, 10, 11]. evolved to take into account the structure of the prob- The performance of theses models (e.g. encoder- lem to be resolved. decoder networks), however, degrades rapidly as the Deep learning is a part of machine learning that length of the input sequence increases and they are not is based on neural networks and uses multiple layers, able to capture the dynamic (i.e., time-changing) inter- where each layer extracts higher level features from the dependence between time steps. The combination of input. RNNs are a special class of neural networks neural networks with new differentiable modules could where outputs from previous steps are fed as inputs to overcome some of those problems and offer new oppor- the current step [13, 14]. This recurrence makes them tunities and applications. appropriate for modelling dynamic processes and sys- Among the potential applications of differentiable tems. programming to dynamical systems let us mention CNNs are neural networks that alternate convolu- (i) attention mechanisms to select the relevant time tional and pooling layers to implement translational in- steps and inputs, variance [15]. They learn spatial hierarchies of features (ii) memory networks to store historical data from dy- through backpropagation by using these building layers. namical systems and selectively use it for modelling and CNNs are being applied successfully to computer vision prediction, and and image processing [16]. (iii) the use of differentiable components in scientific Especially important is the use of distributed rep- computing. resentations as inputs to natural language processing Despite some achievements, more work is still needed pipelines. With this technique, the words of the vocab- to verify the benefits of these models over traditional ulary are mapped to an element of a vector space with a networks. much lower dimensionality [17, 18]. This word embed- Thanks to software libraries that facilitate auto- ding is able to keep, in the learned vector space, some matic differentiation, differentiable programming ex- of the syntactic and semantic relationships presented in tends deep learning models with new capabilities (rea- the original data. soning, memory, attention, etc.) and the models can be Let us recall that, in a feedforward neural network efficiently coded and implemented. (FNN) composed of multiple layers, the output (without In the following sections of this tutorial we introduce the bias term) at layer l, see Figure 1, is defined as differentiable programming and explain in detail why it is an extension of deep learning (Section 2). We de- xl+1 = σ(Wl xl); (1) scribe some models based on this new approach such as attention mechanisms (Section 3.1), memory networks Wl being the weight matrix at layer l. σ is the activation and differentiable neural computers (Section 3.2), and function and xl+1, the output vector at layer l and the continuous learning (Section 3.3). Then we review the input vector at layer l + 1. The weight matrices for the use of deep learning in dynamical systems and their lim- different layers are the parameters of the model. itations (Section 4.1). And, finally, we present the new Learning is the mechanism by which the parame- opportunities that differentiable programming can bring ters of a neural network are adapted to the environment to the modelling, simulation and prediction of dynami- in the training process. This is an optimization prob- cal systems (Section 4.2). The conclusions and outlook lem which has been addressed by using gradient-based are summarized in Section 5. methods, in which given a cost function f : Rn ! R, ∗ the algorithm finds local minima w = arg minw f (w) updating each layer parameter wi j with the rule wi j := 2. From deep learning to differentiable program- wi j − ηrwi j f (w), η > 0 being the learning rate. ming In addition to regarding neural networks as universal approximators, there is no sound theoretical explanation In recent years, we have seen major advances in the for a good performance of deep learning. Several theo- field of machine learning. The combination of deep retical frameworks have been proposed: neural networks with the computational capabilities of Graphics Processing Units (GPUs) [12] has improved (i) As pointed out in [19], the class of functions of the performance of several tasks (image recognition, practical interest can be approximated with expo- machine translation, language modelling, time series nentially fewer parameters than the generic ones. 2 (i) Programs are directed acyclic graphs. (ii) Graph nodes are mathematical functions or variables and the edges correspond to the flow of in- termediate values between the nodes. (iii) n is the number of nodes and l the number of input variables of the graph, with 1 ≤ l < n. vi for i 2 f1; :::; ng is the variable associated with node i. (iv) E is the set of edges in the graph. For each (i; j) 2 E we have i < j, therefore the graph is topologically ordered. (v) fi for i 2 f(l+1); :::; ng is the differentiable function computed by node i in the graph. αi for i 2 f(l + 1); :::; ng contains all input values for node i. (vi) The forward algorithm or pass, given input vari- Figure 1: Multilayer neural network. ables v1; :::; vl calculates vi = fi(αi) for i = f(l + 1); :::; ng. Symmetry, locality and compositionality proper- (vii) The graph is dynamically constructed and com- ties make it possible to have simpler neural net- posed of parametrizable functions that are differ- works. entiable and whose parameters are learned from (ii) From the point of view of information theory [20], data. an explanation has been put forward based on how Then, neural networks are just a class of these differ- much information each layer of the neural network entiable programs composed of classical blocks (feed- retains and how this information varies with the forward, recurrent neural networks, etc.) and new ones training and testing process. such as differentiable branching, attention, memories, Although deep learning can implicitly implement etc.

Differentiable Programming and Its Applications to Dynamical Systems

Training Autoencoders by Alternating Minimization

Deep Learning Based Computer Generated Face Identification Using

Matrix Calculus

Differentiability in Several Variables: Summary of Basic Concepts

Lipschitz Recurrent Neural Networks

Neural Networks

Jointly Clustering with K-Means and Learning Representations

Differentiability of the Convolution

The Simple Essence of Automatic Differentiation

Expressivity of Deep Neural Networks∗

Differentiable Programming for Piecewise Polynomial

GAN and What It Is?