A Differential-Form Pullback Programming Language for Higher

A Differential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Differentiation Carol Mak Luke Ong Abstract where the computational graph is constructed dynamically Building on the observation that reverse-mode automatic during the execution. differentiation (AD) — a generalisation of backpropagation Can we replace the traditional graphical representation of — can naturally be expressed as pullbacks of differential 1- reverse-mode AD by a simple yet expressive framework? In- forms, we design a simple higher-order programming lan- deed, there have been calls from the neural network com- guage with a first-class differential operator, and present munity for the development of differentiable programming a reduction strategy which exactly simulates reverse-mode [14, 19, 24], based on a higher-order functional language AD. We justify our reduction strategy by interpreting our with a built-in differential operator that returns the deriv- language in any differential λ-category that satisfies the Hahn- ative of a given program via reverse-mode AD. Such a lan- Banach Separation Theorem, and show that the reduction guage would free the programmer from implementational strategy precisely captures reverse-mode AD in a truly higher- details of differentiation. Programmers would be able to con- order setting. centrate on the construction of machine learning models, and train them by calling the built-in differential operator 1 Introduction on the cost function of their models. The goal of this work is to present a simple higher-order Automatic differentiation (AD) [34] is widely considered the programming language with an explicit differential oper- most efficient and accurate algorithm for computing deriva- ator, such that its reduction semantics is exactly reverse- tives, thanks largely to the chain rule. There are two modes mode AD, in a truly higher-order manner. of AD: The syntax of our language is inspied by Ehrhard and • Forward-mode AD evaluates the chain rule from in- Regnier [15]’s differential λ-calculus, which is an extension puts to outputs; it has time complexity that scales with of simply-typed λ-calculus with a differential operator that the number of inputs, and constant space complexity. mimics standard symbolic differentiation (but not reverse- • Reverse-mode AD — a generalisation of backpropaga- mode AD). Their definition of differentiation via a linear tion — evaluates the chain rule (in dual form) from substitution provides a good foundation for our language. outputs to inputs; it has time complexity that scales The reduction strategy of our language uses differential with the number of outputs, and space complexity λ-category [11] (the model of differential λ-calculus) as a arXiv:2002.08241v1 [cs.PL] 19 Feb 2020 that scales with the number of intermediate variables. guide. Differential λ-category is a Cartesian closed differen- In machine learning applications such as neural networks, tial category [9], and hence enjoys the fundamental prop- the number of input parameters is usually considerably erties of derivatives, and behaves well with exponentials larger than the number of outputs. For this reason, reverse- (curry). mode AD has been the preferred method of differentiation, especially in deep learning applications. (See Baydin et al. Contributions. Our starting point (Section 2.2) is the obser- [5] for an excellent survey of AD.) vation that the computation of reverse-mode AD can natu- The only downside of reverse-mode AD is its rather in- rally be expressed as a transformation of pullbacks of dif- volved definition, which has led to a variety of compli- ferential 1-forms. We argue that this viewpoint is essential cated implementations in neural networks. On the one hand, for understanding reverse-mode AD in a functional setting. TensorFlow [1] and Theano [3] employ the define-and-run Standard reverse-mode AD (as presented in [4, 5]) is only approach where the model is constructed as a computa- defined in Euclidean spaces. tional graph before execution. On the other hand, PyTorch We present (in Section 3) a simple higher-order program- [25] and Autograd [20] employ the define-by-run approach ming language, extending the simply-typed λ-calculus [12] 1 Carol Mak and Luke Ong n with an explicit differential operator called the pullback, where fj := πj ◦ f : R → R. We call the function J : (Ω λx.P)· S, which serves as a reverse-mode AD simulator. C∞(Rn, Rm) → C∞(Rn, L(Rn, Rm)) the Jacobian; 1 J(f ) the Using differential λ-category [11] as a guide, we design a Jacobian of f ; J(f )(x) the Jacobian of f at x; J(f )(x)(v) reduction strategy for our language so that the reduction of the Jacobian of f at x along v ∈ Rn and λx.J(f )(x)(v) the ∗ the application, (Ω λx.P)·(λx.ep ) S, mimics reverse-mode Jacobian of f along v. AD in computing the p-th row of the Jacobian matrix (deriv- Symbolic Differentiation Numerical derivatives are stan- ative) of the function λx.P at the point S, where ep is the dardly computed using symbolic differentiation: first com- column vector with 1 at the p-th position and 0 everywhere ∂ pute fj for all i, j using rules (e.g. product and chain rules), else. Moreover, we show how our reduction semantics can ∂zi be adapted to a continuation passing style evaluation (Sec- then substitute x0 for z to obtain J(f )(x0). tion 3.5). For example, to compute the Jacobian of f : hx,yi 7→ + + 2 2 , Owing to the higher-order nature of our language, stan- (x 1)(2x y ) at h1 3i by symbolic differentiation, first ∂ compute f = 2(x + 1)(2x + y2)(2x + y2 + 2(x + 1)) and dard differential calculus is not enough to model our lan- ∂x ∂f = + + 2 + guage and hence cannot justify our reductions. Our final ∂y 2(x 1)(2x y )(2y(x 1)). Then, substitute 1 for x contribution (in Section 4) is to show that any differential and 3 for y to obtain J(f )(h1, 3i) = 660 528 . λ-category [11] that satisfies the Hahn-Banach Separation Symbolic differentiation is accurate but inefficient. Notice + ∂f + Theorem is a model of our language (Theorem 4.6). Our re- that the term (x 1) appears twice in ∂x , and (1 1) is evalu- duction semantics is faithful to reverse-mode AD, in that it ∂f , + + 2 ated twice in ∂x (because for h : hx yi 7→ (x 1)(2x y ), is exactly reverse-mode AD when restricted to first-order; 1 both h(hx,yi) and ∂h contain the term (x + 1), and the prod- moreover we can perform reverse-mode AD on any higher- ∂x uct rule tells us to calculate them separately). This dupli- order abstraction, which may contain higher-order terms, cation is a cause of the so-called expression swell problem, duals, pullbacks, and free variables as subterms (Corollary resulting in exponential time-complexity. 4.8). Finally, we discuss related works in Section 5 and conclu- Automatic Differentiation Automatic differentiation sion and future directions in Section 6. (AD) avoids this problem by a simple divide-and-conquer Throughout this paper, we will point to the attached Ap- approach: first arrange f as a composite of elementary2 pendix for additional content. All proofs are in Appendix E, functions, g1,...,gk (i.e. f = gk ◦···◦ g1), then compute unless stated otherwise. the Jacobian of each of these elementary functions, and finally combine them via the chain rule to yield the desired Jacobian of f . 2 Reverse-mode Automatic Differentiation Forward-mode AD Recall the chain rule: We introduce forward- and reverse-mode automatic dif- J(f )(x0) = J(gk )(xk−1)×···×J(g2)(x1)×J(g1)(x0) ferentiation (AD), highlighting their respective benefits in = = practice. Then we explain how reverse-mode AD can nat- for f gk ◦···◦ g1, where xi : gi (xi−1). Forward-mode urally be expressed as the pullback of differential 1-forms. AD computes the Jacobian matrix J(f )(x0) by calculating = = = I (The examples used to illustrate the above methods are col- αi : J(gi )(xi−1)×αi−1 and xi : gi (xi−1), with α0 : (iden- = lated in Figure 4). tity matrix) and x0. Then, αk J(f )(x0) is the Jacobian of f at x0. This computation can neatly be presented as an iter- g ation of the h·|·i-reduction, hx | αi −→hg(x)|J(g)(x)× αi, for g = g ,...,g , starting from the pair hx | Ii. Besides be- 2.1 Forward- and Reverse-mode AD 1 k 0 ing easy to implement, forward-mode AD computes the new Recall that the Jacobian matrix of a smooth real-valued func- pair from the current pair hx | αi, requiring no additional Rn Rm Rn tion f : → at x0 ∈ is memory. ∂f1 ∂f1 ... ∂f1 To compute the Jacobian of f : hx,yi 7→ (x + 1)(2x + ∂z1 ∂z2 ∂zn x0 x0 x0 2 2 ∂f2 ∂f2 ... ∂f2 y ) at h1, 3i by forward-mode AD, first decompose f into ∂z1 ∂z2 ∂zn J(f )(x ) := x0 x0 x0 0 . . 1C∞(A, B) is the set of all smooth functions from A to B, and L(A, B) is ∂fm ∂fm ... ∂fm ∂z1 ∂z2 ∂zn the set of all linear functions from A to B, for Euclidean spaces A and B. x0 x0 x0 2in the sense of being easily differentiable 2 Differential-form Pullback Programming Language and Reverse-mode AD g ∗ (−)2 elementary functions as R2 −−→ R2 −−→ R −−−→ R, where Whenever this is the case, reverse-mode AD is more efficient g(hx,yi) := hx + 1, 2x + y2i. Then, starting from hh3, 1i| Ii, than forward-mode. iterate the h·|·i-reduction Remark 2.1.

Load more