<<

The Differentiable Curry

Martin Abadi Richard Wei Gordon Plotkin Dimitrios Vytiniotis Dan Belov Google Brain DeepMind

1 Introduction plications in a , such as extra inlining and loop unrolling or early defunctionalization, and to Differentiable programming allows programmers to allow for separate compilation, to name a few. calculate program gradients and unlocks experimen- This is the challenge we address. We focus on tation with new optimizers and neural network archi- (i) statically-typed, (ii) compile-time, (ii) reverse-mode tectures. This is why modern deep learning frame- AD, a scenario exemplified by Swift AD. (http://bit. works [1, 23, 20, 7] introduce APIs (e.g. ly/swift-autodiff) Our contributions are: tf.gradients in TensorFlow). Programmers ask for the gradient of an objective with respect to • Following recent work [11] we introduce combina- its ; which is then used to optimize these tors for rep-functions that also return pullback lin- parameters, e.g. through stochastic gradient descent. ear maps (back-propagators), and show how they Recent projects, as Swift for TensorFlow (S4TF) can be used for AD. Generalizing to higher-order (www.tensorflow.org/swift) and Julia Zygote [15], functions boils down to introducing differentiable in the spirit of the seminal “Lambda the Ultimate curry and combinators. Backpropagator” (LTUB) [21], advocate AD as a first-class construct in a general-purpose programming • Higher-order functions (like curry) accept or re- language, and aim to take advantage of traditional turn closures. For a function from tensors to compiler optimizations for efficient code generation. tensors, its pullback at some input is also a linear The idea is to produce statically, for every differ- map from tensors to tensors. We use the term entiable function of type a → b, another function re- tangent space, denoted as Tant , for the space of turning the result of the original function and a linear permutations of values of a type t. Hence the back-propagator map, of type a → (b, Tanb −◦ Tana ). tangent space of a tensor is just a tensor. But We call these compiler-generated functions the rep- what should be the tangent space of a function resentation functions (rep-functions for short) of dif- type? Perhaps surprisingly, a itself ferentiable functions, and will use a b as a type is not the right answer. We provide two possi- abbreviation for a → (b, Tanb −◦ Tana ). To achieve ble implementations for function tangents and this, the AD pass merely composes rep-functions out differentiable , and explain the tradeoffs. of primitive rep-functions like those for (+) and (−), by systematically lifting these primitives through the • The first implementation – novel to our knowl- constructs of the . edge – is typeable in a simply-typed discipline An important challenge in this setting is the dif- and bears similarities to tracing but may involve ferentiation of functions that accept or return other re-computation of the function we differentiate functions, perhaps capturing (differentiable or non- during the execution of its backwards derivative. differentiable) variables. Partial applications must not “forget” to back-propagate to captured variables, and • The second is a dependently typed version of an more generally we need AD that provably preserves idea behind “Lambda the Ultimate Backpropaga- equational reasoning – needed to justify inlining, com- tor” [21], itself inspired by the view of functions mon sub-expression elimination etc. As we will see as closures whose tangents are the tangents of (Section 2), higher-order functions are ubiquitous in their environments. We put that idea in the modern statically-typed languages, even inside the im- combinatory setting of Elliott. Our work implies plementation of end-to-end first-order programs. They that an “erased” version that relies on reinterpret have to be tackled heads-on to avoid additional com- casts in a weaker is in fact safe.

1 2 Higher-order functions and AD J [[∆ ` e : τ]] = b Higher-order functions may not seem essential for differentiation but in a general-purpose programming b = J [[∆, (x:τ) ` e : σ]] BDLam language (e.g. Swift) are actually ubiquitous, even J [[∆ ` diffλx:τ.e : τ σ]] = curry b inside the implementation of a first-order program.

Consider, for instance, the case of a recurrent neural b1 = J [[∆ ` e1 : τ σ]]

network (RNN) model. An RNN, in its simplest form, b2 = J [[∆ ` e2 : τ]] BDApp

folds a state transformer (called the RNN cell – e.g. a J [[∆ ` e1 e2 : σ]] = prod b1 b2 ◦ eval Long Short-Term Memory [14])) through a sequence of inputs and produces a final state. The state is often x is i-th variable in ∆ BDVar

called the “hidden state” of the RNN: J [[∆ ` x : τ]] = proji

b1 = J [[∆ ` e1 : τ1]] b2 = J [[∆ ` e2 : τ2]] BDProd

rnnCell :: (Params, Tensor, Tensor) → Tensor J [[∆ ` (e1, e2):(τ1, τ2)]] = prod b1 b2 rnnCell (params, hidden_state, input) = ... // return new hidden_state Figure 1: Translating to combinators runRNN :: Params Tensor [Tensor] Tensor runRNN params init_state xs = letg :: Tensor Tensor Tensor 3 Combinatory-style AD g hid input = lstmCell (params, hid, input) in foldg init_state xs We first illustrate how AD can be implemented using the aforementioned of combinators. Here function g is the of rnnCell 3.1 Step 1: Translate to combinators on params, passed to the recursive function fold. This We first use the combinator language in the previous example shows many features required for a general section as the target of conversion from a (conven- purpose language; (i) partial applications, (ii) recur- tional) call-by-value higher-order of sive functions (such as fold), (iii) recursive datatypes differentiable functions λ . The translation, in Fig- (such as [ Tensor] above). In particular function g, the ∂ ure 1, has (yet) nothing to do with AD; rather it partial application of rnnCell captures the parameters is reminiscent of the well-understood translation of and – unlike our simple example above – needs to back- λ-calculus into cartesian closed categories (CCCs) [8]. propagate to these parameters. Moreover, we cannot J [[∆ ` e : τ]] defines such a type-directed trans- eliminate higher-order functions by inlining, unless lation. The rules ensure that if ∆ ` e : τ then we unroll fold. And even if we could unroll fold, ` J [[∆ ` e : τ]] : ∆ τ, where we abuse notation rnnCell might be imported from a different module and refer to ∆ as the of all types of environment whose source is not available for inlining. variables. The conversion also assumes (not shown, for In this extended abstract we will only focus on lack of space) that all differentiable primitives, such as higher-order functions. We will show how we can (∗):(Float, Float) Float come with rep-function express differentiation through higher-order programs implementations. following Elliott’s recipe by providing implementations of a small set of combinators, given below: 3.2 Step 2: Implement combinators Next, we implement our combinators, accepting id :: τ τ and returning a b values, i.e. functions of type curry :: ((τa,τb) τc) → (τa (τb τc)) a → (b, Tanb → Tana ). We also need to define a tan- eval :: (τ σ, τ) σ gent space Tant for every type t. Figure 2 presents prod :: (τy τa) → (τy τb) → (τy (τa,τb)) such a small library. Figure 6 in the Appendix gives (◦) :: (τa τb) → (τb τc) → (τa τc) constτ :: σ → (τ σ) the definition of T [τ] for all the types of λ∂ . The proji :: (τ1,...,τn) τi cases for floats, products, and tensors of floats are standard; also the rules for “discrete” types all return Unfolding types, we see that curry/eval require a Unit. Some combinators (e.g. prod) rely on having definition for function tangents, Tan(τ σ). What 0 and (+) defined for every tangent type T [τ], also these should be (and why) is answered in this work. found in Figure 6. In the next sections we proceed to

2 tion forwards and then backwards for every element, vjp :(τ σ) → τ → T [σ] → T [τ] arriving at the implementation of curry in Figure 3. vjpfx gb = snd (fx ) gb The eval combinator merely records the primal value (x) and the output tangent (g) in a singleton list. mult :: (Float,Float) Float mult (x1,x2) = (x1∗x2, λg.(x2∗g, x1∗g)) 4.1 Properties and metatheory Is our construction correct? We answer by showing proj_left :: (τ,σ) τ proj_left (a,b) = (a, λg.(g, 0)) that it respects equational reasoning principles. For example, when given f :(τ1, τ2) τ3 which we can prod :: (τ σ1) → (τ σ2) → (τ (σ1, σ2)) curry and repeatedly evaluate with an argument of prodfgy = let (a, pbf) = fy type τ1 and another of type τ2, we will get a func- (b, pbg) = gy tion that is not only forward-equivalent, but also has in ((a, b), λ(ga, gb). pbf ga + pbg gb) an equivalent back-propagator. An example are the (◦) :: (τa τb) → (τb τc) → (τa τc) forward-equivalent functions foo1 and foo2 in the (◦) fga = let (b, pbf) = fa “Partial Application” column of Figure 4. (c, pbg) = gb Consider foo1 and foo2 in the “Forgetting Results” in (c, λgc. pbf (pbg gc)) column of Figure 4. The two functions should be Figure 2: Library of combinators (excerpt) equivalent in forward and reverse mode, but for foo1 we will back-propagate a tangent value of [(x, 0)] for the use of g. In foo2, since g is not used at all, we will back-propagate []. We want to therefore treat [(x, 0)] :: ((τ ,τ ) τ ) → (τ (τ τ )) curry a b c a b c and [] as equivalent, even if they are different lists. curryf = new_f Finally, foo1 and foo2 in the “Summing Results” where new_f :: τa → (τb τc, T [τb τc] → T [τa]) new_ft = column are forward-equivalent, hence we expect equiv- let new_g :: τb → (τc, T [τc] → T [τb]) alent back-propagators. In the first case, we call f new_gs = let (r, pb) = f(t, s) multiple times with the same argument and sum the in (r, λgr. snd (pb gr)) results; in the second case we call it once. The tan- new_pb :: T [τb τc] → T [τa] gents that are back-propagated to f in the first case new_pb grs = will be [(x, g), (x, g)] where g is the tangent corre- let aux (s,gr) = fst (snd (f (t, s)) gr) sponding to the result of foo1. In the second case we in sum (map aux grs) get [(x, g + g)]. We need these two tangents to be ( , ) in new_g new_pb treated as equivalent, even if they are different lists. eval :: (τ σ, τ) σ eval (f, x) = let (y, pb) = fx We have formalized thus a notion of equivalence that in (y, λg. ([(x,g)], pbg )) goes beyond β-equivalence for back-propagators, and showed various CCC laws hold of our combinators wrt. Figure 3: Simply-typed differentiable curry/eval that equivalence. These laws guarantee equivalences for the examples presented in this section.

discuss the highlighted parts in Figure 6, to do with 5 Dependently-typed curry function tangents, curry, and eval. Unfortunately the differentiable curry in Figure 3 has 4 Simply-typed curry a back-propagator new pb that involves a full forward computation of the original function f, at each of the We define curry and eval in Figure 3. These defini- recorded inputs it was applied to – hence suffers from tions together with the need for having an addition redundant computation. We need something better. and a zero operator, effectively force the equation A key insight from the LTUB work [21], leading to T [τ σ] = [(τ, T [σ])], a list of pairs of values and an efficient solution, has been this: a back-propagator result tangents. Addition and zero are given by list for a function τ σ should take as argument a value concatenation and the empty list, as we see in the of type T [σ] but return not only tangent T [τ] but highlighted parts of Figure 6. These lists intuitively also the tangent of the environment ∆ over which track all calls of a function and hence we have to sum the function closed when it was constructed; T [∆]. up all the resulting tangents from running our func- To understand the intuition, it’s helpful to think of

3 f :: (Float, Float) Float foo1, foo2 :: (Float Float, foo1 :: (Float Float) Float Float) Float Float foo1, foo2 :: (Float, Float) Float Float Float foo1fx = fx + fx foo1 (a, b) = (λxb → f (a, xb)) b foo1 (f, g) x = fst (fx , gx ) foo2fx = lety = fx in (y + y) foo2 (a, b) = f (a, b) foo2 (f, g) x = fx Partial Application Forgetting Results Summing Results Figure 4: Example equivalences (foo1 and foo2 in each column)

a function value as just (i) a static top-level code pointer that does not vary with any input, plus (ii) curry :: ((τt,τs) τr) → (τt (τs τr)) an environment of captured values that could vary curry (exT τ [ f) = exT () new_f as these captured values vary. In other words, for a where [ closure of type τ σ, capturing environment ∆, its new_f :: Π(t:τt). Σ(g : τs τr). T [g] → (τ , T [t]) tangent space is T [τ σ] = T [∆]. new_ft = [ LTUB originally presented an AD system for a dy- let gf :: Π(s:τs).Σ(r:τr). T [r] → ((τ , T [t]), T [s]) namically typed language. But in a static type system gfs = let (r, pullback) = f(t,s) we immediately hit a problem: it is no longer the case in (r, λgr → that T [·] can be a type operator, because different val- let (cte,(ctt,cts)) = pullback gr in ((cte,ctt), cts)) ues of type τ σ capture different environments. We g = exT (τ [, T [t]) gf therefore revise T [τ] to now become a value-dependent [ type operator that takes a value of type τ as an ar- new_pb :: T [g] → (τ , T [t]) = gument. We write T [v : τ] to denote this dependent new_pb env env in (g, new_pb) tangent type, and when the type is obvious from the eval :: (τ σ, τ) σ context we will simply be writing T [v]. eval = exT () new_f [ Definition 5.1 (Dependently typed rep-function). where new_f (exT τ f, x) = let (y, pb) = fx in (y, λg → ((), pbg )) We define τ σ to be the following type: comp :: (a b) → (b c) → (a c) [ [ [ [ [ [ comp (exT τf f)(exT τg g) = exT (τf , τg ) h ∃τ∆.Π(x:τ).Σ(y:σ).T [y] (T [x], τ∆) whereha = let (b, pb_f) = fa [ (c, pb_g) = gb where τ∆ denotes a first-order type corresponding to the tangents of the closure environment. in (c, λgc → let (envg, gb) = pb_g gc (envf, ga) = pb_f gb Definition 5.2 (Dependently typed tangents). We in ((envf, envg), ga) define T [v : τ] where v is a closed term of type τ similarly to the previous non-dependent definition, Figure 5: Dependently-typed differentiable curry/eval but modify the case for functions as follows:

T [v : τ σ] = match v with exT τ [ ⇒ τ [ ∆ ∆ In Figure 5 we give such a curry and eval, in a non- These definitions – reminiscent of typed closure con- opinionated dependent-type notation (we also have version [19] – deserve some discussion. Definition 5.1 produced in Agda and Coq). The reader is urged is just a small augmentation of the rep-function type to examine the code, but ignore the type signatures, that we have been familiarized with so far. It existen- resting assured, it all type checks. Another remark is [ that composition (also in Figure 5), simply collects tially quantifies over an environment tangent type τδ , and returns a function that when given an argument together environment tangents. These need not be (x:τ) it will return a dependent sum (y:σ) and an addi- maps from variables to tangents, rather just . tional back-propagator function. The back-propagator We have showed that dependently typed currying [ also return a τ∆. Intuitively, is corresponds to the satisfies similar laws as the simply-typed version. As tangent of the environment captured in the function. a final remark, most production languages do not For this reason Definition 5.2, opens up one of these support dependent types. There our solution can be existential types and returns the witness type. safely implemented using reinterpret casts. This is, in

4 fact, a tentative proposal for the Swift AD project.1 deep learning applications [12] – the latter supporting variable capture through early defunctionalization. 6 Discussion and extensions References 6.1 Internalizing differentiation [1] Mart´ınAbadi, Paul Barham, Jianmin Chen, Zhifeng Chen, We have not yet described how to calculate vector- Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Jacobian products. Indeed, rep-functions are ordi- Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. nary functions returning linear maps, and our λ∂ of Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Section 3 does not include ordinary (→), λ, or appli- Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang cations. As a result, it cannot type vjp (Figure 2)! Zheng. Tensorflow: A system for large-scale machine learn- ing. In Proceedings of the 12th USENIX Conference on We believe there is a principled way to integrate ordi- Operating Systems Design and Implementation, OSDI’16, nary and differentiable arrows, while still preserving pages 265–283, Berkeley, CA, USA, 2016. USENIX Asso- equational reasoning, out of for this abstract. ciation. [2] Atılım G¨unesBaydin, Barak A. Pearlmutter, Alexey An- 6.2 Further features dreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. J. Mach. By taking the same approach of (i) compiling to com- Learn. Res., 18(1):5595–5637, January 2017. binators, and (ii) implementing these combinators in [3] Atılım G¨unesBaydin, Barak A. Pearlmutter, and Jef- terms of a core language we can show how to intro- frey Mark Siskind. Diffsharp: An AD library for .net duce control flow and recursion into a differentiable languages. CoRR, abs/1611.03423, 2016. programming language. Supporting richer algebraic [4] Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen types may also be possible, through user-specified Grauman, Nicol`oCesa-Bianchi, and Roman Garnett, edi- tors. Advances in Neural Information Processing Systems definitions for the tangent spaces of these types. 31: Annual Conference on Neural Information Process- ing Systems 2018, NeurIPS 2018, 3-8 December 2018, 7 Related work Montr´eal,Canada, 2018. [5] Richard Blute, Thomas Ehrhard, and Christine Tasson. We presented the marriage of ideas behind Elliot’s A convenient differential category. CoRR, abs/1006.3140, categorical presentation of AD [11] and the seminal 2010. To appear in Cahiers de Top. et G´eom.Diff. Cat. LTUB work [21]. We extend Elliot’s presentation to [6] Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont differentiable currying and evaluation, while putting Samuels, and Nick Seltzer. Diderot: A parallel dsl for image analysis and visualization. In Proceedings of the 33rd the ideas of Pearlmutter and Siskind in a typed setting. ACM SIGPLAN Conference on Programming Language The idea of using closures as back-propagators is Design and Implementation, PLDI ’12, pages 111–120, receiving recent attention. For example Julia Zy- New York, NY, USA, 2012. ACM. gote [15] and Swift AD adopts this design. Other [7] Ronan Collobert, Samy Bengio, and Johnny Mari´ethoz. Torch: A modular machine learning software library. Tech- recent work follows similar ideas [27, 26] but is using nical report, IDIAP Research Institute, 2002. meta-programming as an implementation technique. [8] P-L Curien. Categorical combinators. Inf. Control, 69(1- There exists work on differentiation semantics [24] 3):188–254, April 1986. and differentiable categories [5], usually interpreting [9] Thomas Ehrhard and Laurent Regnier. The differential types as vector spaces. Related ideas have appeared lambda-calculus. Theoretical , 309(1):1 for higher-order lambda calculi [18, 9] but the under- – 41, 2003. lying foundations only tackle forward mode. [10] Conal Elliott. Beautiful differentiation. In International Conference on (ICFP), 2009. AD has a long history in array programming and [11] Conal Elliott. The simple essence of automatic differenti- scientific computing [13]. Forward-mode AD has ation. Proc. ACM Program. Lang., 2(ICFP):70:1–70:29, been presented before for (first-order) functional pro- July 2018. grams [16, 10], as libraries in general purpose lan- [12] Roy Frostig, Peter Hawkins, Matthew James John- guages [3], DSLs [6], and more. We urge the reader to son, Chris Leary, Dougal Maclaurin, and Skye consult the comprehensive survey [2]. Recent sys- Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. tems revisit efficient differentiable array program- [13] L. Hasco¨etand V. Pascual. The Tapenade automatic differ- ming [22, 25]. There exist also widely used AD li- entiation tool: principles, model, and specification. ACM braries for Python [17], and new systems that target Transactions On Mathematical Software, 39(3), 2013. [14] Sepp Hochreiter and J¨urgenSchmidhuber. Long short- 1Using the erased AnyDerivative struct, see https://www. term memory. Neural Comput., 9(8):1735–1780, November tensorflow.org/swift/api_docs/Structs/AnyDerivative. 1997.

5 [15] Michael Innes. Don’t unroll adjoint: Differentiating ssa- form programs. CoRR, abs/1810.07951, 2018. [16] Jerzy Karczmarczuk. Functional differentiation of com- puter programs. In Proceedings of the Third ACM SIG- PLAN International Conference on Functional Program- ming, ICFP ’98, pages 195–203, New York, NY, USA, 1998. ACM. [17] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in Numpy. In ICML Work- shop on Automatic Machine Learning, 2015. [18] Oleksandr Manzyuk. A simply typed λ-calculus of forward automatic differentiation. Electr. Notes Theor. Comput. Sci., 286:257–272, 2012. [19] Yasuhiko Minamide, Greg Morrisett, and Robert Harper. Typed closure conversion. In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’96, pages 271–283, New York, NY, USA, 1996. ACM. T [ ] = [20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Float Float Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- T [Tensor(Float)] = Tensor(Float) ban Desmaison, Luca Antiga, and Adam Lerer. Automatic T [(τ, σ)] = (T [τ], T [σ]) differentiation in PyTorch. In NIPS Autodiff Workshop, T [τ σ] = [(τ, T [σ])] 2017. T [Unit] = Unit [21] Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse- T [Tensor(Int)] = Unit mode ad in a functional framework: Lambda the ulti- T [Int] = Unit mate backpropagator. ACM Trans. Program. Lang. Syst., 30(2):7:1–7:36, March 2008. T [Bool] = Unit [22] Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytini- 0 = 0 otis, Simon Peyton Jones, and Christoph Koch. Effi- T [Float] cient differentiable programming in a functional array- 0T [Tensor(Float)] = 0 processing language. CoRR, abs/1806.02136, 2018. http: 0T [(τ,σ)] = (0T [τ], 0T [σ]) //arxiv.org/abs/1806.02136. 0T [τ σ] = [] [23] Theano Development Team. Theano: A Python framework 0T [τ] = () for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. x1 +T [Float] x2 = x1 + x2 [24] M. V´ak´ar, O. Kammar, and S. Staton. Diffeological x1 +T [Tensor(Float)] x2 = x1 + x2 spaces and semantics for differential programming, 2018. (x11, x12) + (x21, x22) = (x11+ x12, Presented at Domains XIII Workshop, slides available at T [(τ,σ)] T [τ] x + x ) https://andrejbauer.github.io/domains-floc-2018/ 21 T [σ] 22 slides/Matthijs-Kammar-Staton.pdf. x1 +T [τ σ] x2 = x1 ++ x2 [25] Bart van Merrienboer, Dan Moldovan, and Alexander B. Wiltschko. Tangent: Automatic differentiation using Figure 6: Tangent spaces for λ∂ (simply-typed) source-code transformation for dynamically typed array programming. In Bengio et al. [4], pages 6259–6268. [26] Fei Wang, James M. Decker, Xilun Wu, Gr´egoryM. Es- sertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable pro- gramming. In Bengio et al. [4], pages 10201–10212. [27] Fei Wang, Xilun Wu, Gr´egoryM. Essertel, James M. Decker, and Tiark Rompf. Demystifying differentiable programming: Shift/reset the penultimate backpropaga- tor. CoRR, abs/1803.10228, 2018. http://arxiv.org/ abs/1803.10228. A Appendix

6