A Differential-Form Pullback Programming Language for Higher

A Differential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Differentiation Carol Mak Luke Ong Abstract where the computational graph is constructed dynamically Building on the observation that reverse-mode automatic during the execution. differentiation (AD) — a generalisation of backpropagation Can we replace the traditional graphical representation of — can naturally be expressed as pullbacks of differential 1- reverse-mode AD by a simple yet expressive framework? In- forms, we design a simple higher-order programming lan- deed, there have been calls from the neural network com- guage with a first-class differential operator, and present munity for the development of differentiable programming a reduction strategy which exactly simulates reverse-mode [14, 19, 24], based on a higher-order functional language AD. We justify our reduction strategy by interpreting our with a built-in differential operator that returns the deriv- language in any differential λ-category that satisfies the Hahn- ative of a given program via reverse-mode AD. Such a lan- Banach Separation Theorem, and show that the reduction guage would free the programmer from implementational strategy precisely captures reverse-mode AD in a truly higher- details of differentiation. Programmers would be able to con- order setting. centrate on the construction of machine learning models, and train them by calling the built-in differential operator 1 Introduction on the cost function of their models. The goal of this work is to present a simple higher-order Automatic differentiation (AD) [34] is widely considered the programming language with an explicit differential oper- most efficient and accurate algorithm for computing deriva- ator, such that its reduction semantics is exactly reverse- tives, thanks largely to the chain rule. There are two modes mode AD, in a truly higher-order manner. of AD: The syntax of our language is inspied by Ehrhard and • Forward-mode AD evaluates the chain rule from in- Regnier [15]’s differential λ-calculus, which is an extension puts to outputs; it has time complexity that scales with of simply-typed λ-calculus with a differential operator that the number of inputs, and constant space complexity. mimics standard symbolic differentiation (but not reverse- • Reverse-mode AD — a generalisation of backpropaga- mode AD). Their definition of differentiation via a linear tion — evaluates the chain rule (in dual form) from substitution provides a good foundation for our language. outputs to inputs; it has time complexity that scales The reduction strategy of our language uses differential with the number of outputs, and space complexity λ-category [11] (the model of differential λ-calculus) as a arXiv:2002.08241v1 [cs.PL] 19 Feb 2020 that scales with the number of intermediate variables. guide. Differential λ-category is a Cartesian closed differen- In machine learning applications such as neural networks, tial category [9], and hence enjoys the fundamental prop- the number of input parameters is usually considerably erties of derivatives, and behaves well with exponentials larger than the number of outputs. For this reason, reverse- (curry). mode AD has been the preferred method of differentiation, especially in deep learning applications. (See Baydin et al. Contributions. Our starting point (Section 2.2) is the obser- [5] for an excellent survey of AD.) vation that the computation of reverse-mode AD can natu- The only downside of reverse-mode AD is its rather in- rally be expressed as a transformation of pullbacks of dif- volved definition, which has led to a variety of compli- ferential 1-forms. We argue that this viewpoint is essential cated implementations in neural networks. On the one hand, for understanding reverse-mode AD in a functional setting. TensorFlow [1] and Theano [3] employ the define-and-run Standard reverse-mode AD (as presented in [4, 5]) is only approach where the model is constructed as a computa- defined in Euclidean spaces. tional graph before execution. On the other hand, PyTorch We present (in Section 3) a simple higher-order program- [25] and Autograd [20] employ the define-by-run approach ming language, extending the simply-typed λ-calculus [12] 1 Carol Mak and Luke Ong n with an explicit differential operator called the pullback, where fj := πj ◦ f : R → R. We call the function J : (Ω λx.P)· S, which serves as a reverse-mode AD simulator. C∞(Rn, Rm) → C∞(Rn, L(Rn, Rm)) the Jacobian; 1 J(f ) the Using differential λ-category [11] as a guide, we design a Jacobian of f ; J(f )(x) the Jacobian of f at x; J(f )(x)(v) reduction strategy for our language so that the reduction of the Jacobian of f at x along v ∈ Rn and λx.J(f )(x)(v) the ∗ the application, (Ω λx.P)·(λx.ep ) S, mimics reverse-mode Jacobian of f along v. AD in computing the p-th row of the Jacobian matrix (deriv- Symbolic Differentiation Numerical derivatives are stan- ative) of the function λx.P at the point S, where ep is the dardly computed using symbolic differentiation: first com- column vector with 1 at the p-th position and 0 everywhere ∂ pute fj for all i, j using rules (e.g. product and chain rules), else. Moreover, we show how our reduction semantics can ∂zi be adapted to a continuation passing style evaluation (Sec- then substitute x0 for z to obtain J(f )(x0). tion 3.5). For example, to compute the Jacobian of f : hx,yi 7→ + + 2 2 , Owing to the higher-order nature of our language, stan- (x 1)(2x y ) at h1 3i by symbolic differentiation, first ∂ compute f = 2(x + 1)(2x + y2)(2x + y2 + 2(x + 1)) and dard differential calculus is not enough to model our lan- ∂x ∂f = + + 2 + guage and hence cannot justify our reductions. Our final ∂y 2(x 1)(2x y )(2y(x 1)). Then, substitute 1 for x contribution (in Section 4) is to show that any differential and 3 for y to obtain J(f )(h1, 3i) = 660 528 . λ-category [11] that satisfies the Hahn-Banach Separation Symbolic differentiation is accurate but inefficient. Notice + ∂f + Theorem is a model of our language (Theorem 4.6). Our re- that the term (x 1) appears twice in ∂x , and (1 1) is evalu- duction semantics is faithful to reverse-mode AD, in that it ∂f , + + 2 ated twice in ∂x (because for h : hx yi 7→ (x 1)(2x y ), is exactly reverse-mode AD when restricted to first-order; 1 both h(hx,yi) and ∂h contain the term (x + 1), and the prod- moreover we can perform reverse-mode AD on any higher- ∂x uct rule tells us to calculate them separately). This dupli- order abstraction, which may contain higher-order terms, cation is a cause of the so-called expression swell problem, duals, pullbacks, and free variables as subterms (Corollary resulting in exponential time-complexity. 4.8). Finally, we discuss related works in Section 5 and conclu- Automatic Differentiation Automatic differentiation sion and future directions in Section 6. (AD) avoids this problem by a simple divide-and-conquer Throughout this paper, we will point to the attached Ap- approach: first arrange f as a composite of elementary2 pendix for additional content. All proofs are in Appendix E, functions, g1,...,gk (i.e. f = gk ◦···◦ g1), then compute unless stated otherwise. the Jacobian of each of these elementary functions, and finally combine them via the chain rule to yield the desired Jacobian of f . 2 Reverse-mode Automatic Differentiation Forward-mode AD Recall the chain rule: We introduce forward- and reverse-mode automatic dif- J(f )(x0) = J(gk )(xk−1)×···×J(g2)(x1)×J(g1)(x0) ferentiation (AD), highlighting their respective benefits in = = practice. Then we explain how reverse-mode AD can nat- for f gk ◦···◦ g1, where xi : gi (xi−1). Forward-mode urally be expressed as the pullback of differential 1-forms. AD computes the Jacobian matrix J(f )(x0) by calculating = = = I (The examples used to illustrate the above methods are col- αi : J(gi )(xi−1)×αi−1 and xi : gi (xi−1), with α0 : (iden- = lated in Figure 4). tity matrix) and x0. Then, αk J(f )(x0) is the Jacobian of f at x0. This computation can neatly be presented as an iter- g ation of the h·|·i-reduction, hx | αi −→hg(x)|J(g)(x)× αi, for g = g ,...,g , starting from the pair hx | Ii. Besides be- 2.1 Forward- and Reverse-mode AD 1 k 0 ing easy to implement, forward-mode AD computes the new Recall that the Jacobian matrix of a smooth real-valued func- pair from the current pair hx | αi, requiring no additional Rn Rm Rn tion f : → at x0 ∈ is memory. ∂f1 ∂f1 ... ∂f1 To compute the Jacobian of f : hx,yi 7→ (x + 1)(2x + ∂z1 ∂z2 ∂zn x0 x0 x0 2 2 ∂f2 ∂f2 ... ∂f2 y ) at h1, 3i by forward-mode AD, first decompose f into ∂z1 ∂z2 ∂zn J(f )(x ) := x0 x0 x0 0 . . 1C∞(A, B) is the set of all smooth functions from A to B, and L(A, B) is ∂fm ∂fm ... ∂fm ∂z1 ∂z2 ∂zn the set of all linear functions from A to B, for Euclidean spaces A and B. x0 x0 x0 2in the sense of being easily differentiable 2 Differential-form Pullback Programming Language and Reverse-mode AD g ∗ (−)2 elementary functions as R2 −−→ R2 −−→ R −−−→ R, where Whenever this is the case, reverse-mode AD is more efficient g(hx,yi) := hx + 1, 2x + y2i. Then, starting from hh3, 1i| Ii, than forward-mode. iterate the h·|·i-reduction Remark 2.1.

A Differential-Form Pullback Programming Language for Higher

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support