Learning Equations for Extrapolation and Control
Total Page:16
File Type:pdf, Size:1020Kb
Learning Equations for Extrapolation and Control Subham S. Sahoo 1 Christoph H. Lampert 2 Georg Martius 3 Abstract Machine learning research has only very recently started We present an approach to identify concise equa- to look into related techniques. As a first work, Martius & tions from data using a shallow neural network Lampert(2016) recently proposed EQL, a neural network approach. In contrast to ordinary black-box re- architecture for identifying functional relations between ob- gression, this approach allows understanding func- served inputs and outputs. Their networks represent only tional relations and generalizing them from ob- plausible functions through a specific choice of activation served data to unseen parts of the parameter space. functions and it prefers simple over complex solutions by We show how to extend the class of learnable imposing sparsity regularization. However, EQL has two equations for a recently proposed equation learn- significant shortcomings: first, it is not able to represent di- ing network to include divisions, and we im- visions, thereby severely limiting to which physical systems prove the learning and model selection strategy it can be applied, and second, its model selection procedure to be useful for challenging real-world data. For is unreliable in identifying the true functional relation out systems governed by analytical expressions, our of multiple plausible candidates. method can in many cases identify the true under- In this paper, we propose an improved network for lying equation and extrapolate to unseen domains. equation learning, EQL÷, that overcomes the limitation We demonstrate its effectiveness by experiments of the earlier works. In particular, our main contributions are on a cart-pendulum system, where only 2 random 1. we propose a network architecture that can handle di- rollouts are required to learn the forward dynam- visions as well as techniques to keep training stable, ics and successfully achieve the swing-up task. 2. we improve model/instance selection to be more effec- tive in identifying the right network/equation, 1. Introduction 3. we demonstrate how to reliably control a dynamical robotic system by learning its forward dynamics equa- In machine learning, models are typically treated as black- tions from very few random tryouts/tails. box function approximators that are only judged by their ability to predict correctly for unseen data (from the same The following section describes the equation learning distribution). In contrast, in the natural sciences, one method by Martius & Lampert(2016) and introduces our searches for interpretable models that provide a deeper un- improvements. Afterwards, we discuss its relation to other derstanding of the system of interest and allow formulating prior work. In Section4 we present experimental results hypotheses about unseen situations. The latter is only pos- on identifying equations and in Section5 we show its ap- sible if the true underlying functional relationship behind plication to robot control. We close with a discussion and the data has been identified. Therefore, when scientists con- arXiv:1806.07259v1 [cs.LG] 19 Jun 2018 outlook. struct models, they do not only minimize a training error but also impose constraints based on prior knowledge: models should be plausible, i. e. consist of components that have 2. Identifying equation with a network physical expressions in the real world, and they should be We consider a regression problem, where the data originates interpretable, which typically means that they consist only from a system that can be described by an (unknown) ana- of a small number of interacting units. lytical function, φ : Rn ! Rm. A typical example could be 1Indian Institute of Technology, Kharagpur, India 2IST Aus- a system of ordinary differential equations that describes the tria, Klosterneuburg, Austria 3Max Planck Institute for Intelligent dynamics of a robot, or the predator-prey equations of an Systems, Tübingen, Germany. Correspondence to: Georg Martius ecosystem. The observed data, f(x1; y1);:::; (xN ; yN )g <[email protected]>. is assumed to originate from y = φ(x) + ξ with additive Proceedings of the 35 th International Conference on Machine zero-mean noise ξ. Since φ is unknown, we model the input- n m Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 output relationship with a function : R ! R and aim by the author(s). to find an instance that minimizes the empirical error on Learning Equations for Extrapolation and Control id id sin sin cos cos (all-to-all) (all-to-all) (all-to-all) Figure 1. Network architecture of the proposed improved Equation Learner EQL÷ for 3 layers (L = 3) and one neuron per type (u = 3; v = 1). The new division operations are places in the final layer, see Martius & Lampert(2016) for the original model. the training set as well as on future data, potentially from a for gradient based optimization methods. different part of the feature space. For example, we might To overcome the divergence problem, we first notice that want to learn the robot dynamics only in a part of the feature from any real system we cannot encounter data at the pole space where we know it is safe to operate, while later it itself because natural quantities do not diverge. This im- should be possible also to make predictions for movements plies that a single branch of the hyperbola 1=b with b > 0 into unvisited areas. suffices as a basis function. As a further simplification we use divisions only in the ouput layer. 2.1. Equation Learner Finally, in order to prevent problems during optimization ÷ Before introducing EQL , we first recapitulate the work- we introduce a curriculum approach for optimization, pro- ing principles of the previously proposed Equation Learner gressing from a strongly regularized version of division to (EQL) network. It uses a multi-layer feed-forward network the unregularized one. with units representing the building blocks of algebraic ex- pressions. Instead of homogeneous hidden units, each unit Regularized Division: Instead of EQL’s Eq. (1), the last has a specific function, e. g. identity, cosine or multiplica- layer of the EQL÷ is tion, see Fig.1. Complex functions are implemented by (L) θ (L) (L) θ (L) (L) (l) (l) (l−1) (l) y := h1(z1 ; z2 ); : : : ; hm(z2m ; z2m+1) ; (2) alternating linear transformations, z = W y + wo in layer l, with the application of the base-functions. There where hθ(a; b) is the division-activation function given by are u unary functionsf ; : : : ; f , f 2 fidentity; sin; cosg, 1 u i ( which receive the respective component, z ; : : : ; z . The a if b > θ 1 u hθ(a; b) := b ; (3) v binary functions, g1; : : : ; gv receive the remaining com- 0 otherwise ponent, zu+1; : : : ; zu+2v, as input in pairs of two. In EQL these are multiplication units that compute the product of where θ ≥ 0 is a threshold, see Fig.2. Note that using hθ = 0 as the value when the denominator is below θ their two input values: gj(a; b) := a · b. The output of the unary and binary units are concatenated to form the output (forbidden values of b) sets the gradient to zero, avoiding y(l) of layer l. The last layer computes the regression values misleading parameter updates. So the discontinuity plays by a linear read-out no role in practice. y(L) := W (L)y(L−1) + w(L): (1) Penalty term: To steer the network away from negative o values of the denominator, we add a cost term to our ob- For a more detailed discussion of the architecture, see (Mar- jective that penalizes “forbidden” inputs to each division tius & Lampert, 2016). unit: θ 2.2. Introducing division units p (b) := max(θ − b; 0); (4) where θ is the threshold used in Eq. (3) and b is the denomi- The EQL architecture has some immediate shortcomings. nator, see Fig.2. The global penalty term is then In particular, it cannot model divisions, which are, however, common in the equations governing physical systems. We, N n ÷ θ X X θ (L) therefore, propose a new architecture, EQL , that includes P = p (z2j (xi)) (5) division units, which calculate a=b. Note that this is a non- i=1 j=1 trivial step because any division creates a pole at b ! 0 with (L) where z (x ) is the denominator of division unit j for an abrupt change in convexity and diverging function value 2j i input x , see Eq. (2). and its derivative. Such a divergence is a serious problem i Learning Equations for Extrapolation and Control 10 objective is Lasso-like (Tibshirani, 1996), 8 θ = 0:1 ) 6 θ = 0:5 N L ; b 1 X 2 X (l) θ (1 4 0:5 L = k (xi) − yik + λ W + P (8) h p (b) N 1 2 i=1 l=1 0 that is, a linear combination of L2 loss and L1 regularization 1:0 0:5 0:0 0:5 1:0 1:5 2:0 − − extended by the penalty term for small and negative denom- b inators, see Eq. (4). Note that P bound (Eq.6) is only used in Figure 2. Regularized division function hθ(a; b) and the associated the penalty epochs. For training, we apply a stochastic gradi- penalty term pθ(b). The penalty is linearly increasing for function values b < θ outside the desired input values. ent descent algorithm with mini-batches and Adam (Kingma & Ba, 2015) for calculating the updates. The choice of Adam is not critical, as we observed that standard stochastic gradient descent also works, though it might take longer. Regularization Phases: We follow the same regularization Penalty Epochs: While Eq. (5) prevents negative values in scheme as proposed in Martius & Lampert(2016).