<<

arXiv:2002.08241v1 [cs.PL] 19 Feb 2020 a aual eepesda ulak fdffrnil1- differential of pullbacks as expressed be naturally can — D ejsiyorrdcinsrtg yitrrtn our interpreting by strategy reduction our justify We AD. esrlw[]adTen 3 mlythe compli- employ [3] of Theano variety and [1] a TensorFlow to led has which definition, volved 2]adAtga 2]epo the employ PyTorch computa- [20] hand, Autograd a other and the [25] as On constructed execution. before is graph tional model the where approach agaei n differential any in language reverse-mode present simulates exactly and which operator, strategy differential reduction a first-class lan- a programming with higher-order guage simple a design we forms, backpropagation of generalisation automatic a reverse-mode — (AD) that differentiation observation the on Building Abstract ie,tak agl otecanrl.Teeaetomodes two are There AD: rule. of chain the to largely deriva- thanks computing tives, for algorithm accurate and efficient most differentiation Automatic Introduction 1 setting. order high truly a reduction in AD the reverse-mode captures that precisely strategy show and Theorem, Separation Banach 5 o necletsre fAD.) al. of et survey excellent Baydin an (See for applications. [5] learning differentiation, deep of in method preferred especially the reverse- been reason, considerably has this AD For usually mode outputs. of is number parameters the than input larger of number the ae mlmnain nnua ewrs nteoehand, one the On networks. neural in implementations cated ieeta-omPlbc rgamn Language Programming Pullback Differential-form A h nydwsd frvremd Di t ahrin- rather its is AD reverse-mode of downside only The networks, neural as such applications learning machine In • • ihtenme fotus n pc complexity space and outputs, of number the with htsae ihtenme fitreit variables. intermediate of number the with scales that scales that complexity from time form) has dual it (in inputs; rule to chain outputs the evaluates — tion complexity. AD space Reverse-mode constant and inputs, of number with scales the that complexity time has it outputs; to puts AD Forward-mode o ihrodrRvremd Automatic Reverse-mode Higher-order for eeaiaino backpropaga- of generalisation a — A)[4 swdl osdrdthe considered widely is [34] (AD) vlae h hi uefo in- from rule chain the evaluates λ ctgr htstse h Hahn- the satisfies that -category define-by-run define-and-run Differentiation approach ao Mak Carol ueOng Luke er- 1 ihabiti ieeta prtrta eun h deriv- the returns that operator differential built-in a with dynamically constructed is graph computational the where (curry). ainta h optto frvremd Dcnnatu- can AD reverse-mode of computation the that vation λ ene 1]sdifferential [15]’s Regnier reverse- manner. exactly higher-order is truly a semantics in AD, reduction mode its oper- that differential such explicit ator, an with language programming models. their operator of differential built-in cost the the on calling models, by learning them machine train of and construction the on con- centrate to able be would implementational Programmers differentiation. from of lan- details programmer a the Such AD. free reverse-mode would via guage program given a language of functional ative higher-order a on based 24], 19, com- [14, network of neural development the the from for calls munity been framework? have expressive there yet deed, simple a by AD reverse-mode execution. the during fsimply-typed of usiuinpoie odfudto o u language. linear our a for foundation via good differentiation a provides of substitution definition Their reverse- AD). not mode (but differentiation symbolic standard mimics Contributions. exponentials with well behaves prop- and fundamental derivatives, the of enjoys hence erties and [9], category tial ud.Differential guide. iglnug,etnigtesimply-typed the extending language, ming only is 5]) spaces. [4, Euclidean in in defined presented (as AD setting. reverse-mode functional Standard a in AD reverse-mode understanding for of transformation 1-forms a ferential as expressed be rally ctgr 1](h oe fdifferential of model (the [11] -category h ytxo u agaei npe yEradand Ehrhard by inspied is language our of syntax The higher-order simple a present to is work this of goal The of representation graphical traditional the replace we Can h euto taeyo u agaeue differential uses language our of strategy reduction The epeet(nScin3 ipehge-re program- higher-order simple a 3) Section (in present We eageta hsvepiti essential is viewpoint this that argue We . λ u trigpit(eto .)i h obser- the is 2.2) (Section point starting Our cluu ihadffrniloeao that operator differential a with -calculus λ ctgr saCreincoe differen- closed Cartesian a is -category λ cluu,wihi nextension an is which -calculus, ieetal programming differentiable ulak fdif- of pullbacks λ λ cluu)a a as -calculus) cluu [12] -calculus In- Carol Mak and Luke Ong

n with an explicit differential operator called the pullback, where fj := πj ◦ f : R → R. We call the function J : (Ω λx.P)· S, which serves as a reverse-mode AD simulator. C∞(Rn, Rm) → C∞(Rn, L(Rn, Rm)) the Jacobian; 1 J(f ) the Using differential λ-category [11] as a guide, we design a Jacobian of f ; J(f )(x) the Jacobian of f at x; J(f )(x)(v) reduction strategy for our language so that the reduction of the Jacobian of f at x along v ∈ Rn and λx.J(f )(x)(v) the ∗ the application, (Ω λx.P)·(λx.ep ) S, mimics reverse-mode Jacobian of f along v. AD in computing the p-th row of the Jacobian matrix (deriv-  Symbolic Differentiation Numerical derivatives are stan- ative) of the function λx.P at the point S, where ep is the dardly computed using symbolic differentiation: first com- column vector with 1 at the p-th position and 0 everywhere ∂ pute fj for all i, j using rules (e.g. product and chain rules), else. Moreover, we show how our reduction semantics can ∂zi be adapted to a continuation passing style evaluation (Sec- then substitute x0 for z to obtain J(f )(x0). tion 3.5). For example, to compute the Jacobian of f : hx,yi 7→ + + 2 2 , Owing to the higher-order nature of our language, stan- (x 1)(2x y ) at h1 3i by symbolic differentiation, first ∂ compute f = 2(x + 1)(2x + y2)(2x + y2 + 2(x + 1)) and dard differential calculus is not enough to model our lan- ∂x  ∂f = + + 2 + guage and hence cannot justify our reductions. Our final ∂y 2(x 1)(2x y )(2y(x 1)). Then, substitute 1 for x contribution (in Section 4) is to show that any differential and 3 for y to obtain J(f )(h1, 3i) = 660 528 . λ-category [11] that satisfies the Hahn-Banach Separation Symbolic differentiation is accurate but inefficient. Notice + ∂f + Theorem is a model of our language (Theorem 4.6). Our re- that the term (x 1) appears twice in ∂x , and (1 1) is evalu- duction semantics is faithful to reverse-mode AD, in that it ∂f , + + 2 ated twice in ∂x (because for h : hx yi 7→ (x 1)(2x y ), is exactly reverse-mode AD when restricted to first-order; 1 both h(hx,yi) and ∂h contain the term (x + 1), and the prod- moreover we can perform reverse-mode AD on any higher- ∂x uct rule tells us to calculate them separately). This dupli- order abstraction, which may contain higher-order terms, cation is a cause of the so-called expression swell problem, duals, pullbacks, and free variables as subterms (Corollary resulting in exponential time-complexity. 4.8). Finally, we discuss related works in Section 5 and conclu- Automatic Differentiation Automatic differentiation sion and future directions in Section 6. (AD) avoids this problem by a simple divide-and-conquer Throughout this paper, we will point to the attached Ap- approach: first arrange f as a composite of elementary2 pendix for additional content. All proofs are in Appendix E, functions, g1,...,gk (i.e. f = gk ◦···◦ g1), then compute unless stated otherwise. the Jacobian of each of these elementary functions, and finally combine them via the chain rule to yield the desired Jacobian of f .

2 Reverse-mode Automatic Differentiation Forward-mode AD Recall the chain rule: We introduce forward- and reverse-mode automatic dif- J(f )(x0) = J(gk )(xk−1)×···×J(g2)(x1)×J(g1)(x0) ferentiation (AD), highlighting their respective benefits in = = practice. Then we explain how reverse-mode AD can nat- for f gk ◦···◦ g1, where xi : gi (xi−1). Forward-mode urally be expressed as the pullback of differential 1-forms. AD computes the Jacobian matrix J(f )(x0) by calculating = = = I (The examples used to illustrate the above methods are col- αi : J(gi )(xi−1)×αi−1 and xi : gi (xi−1), with α0 : (iden- = lated in Figure 4). tity matrix) and x0. Then, αk J(f )(x0) is the Jacobian of f at x0. This computation can neatly be presented as an iter- g ation of the h·|·i-reduction, hx | αi −→hg(x)|J(g)(x)× αi, for g = g ,...,g , starting from the pair hx | Ii. Besides be- 2.1 Forward- and Reverse-mode AD 1 k 0 ing easy to implement, forward-mode AD computes the new Recall that the Jacobian matrix of a smooth real-valued func- pair from the current pair hx | αi, requiring no additional Rn Rm Rn tion f : → at x0 ∈ is memory. ∂f1 ∂f1 ... ∂f1 To compute the Jacobian of f : hx,yi 7→ (x + 1)(2x + ∂z1 ∂z2 ∂zn x0 x0 x0 2 2 ∂f2 ∂f2 ... ∂f2 y ) at h1, 3i by forward-mode AD, first decompose f into  ∂z1 ∂z2 ∂zn  J(f )(x ) :=  x0 x0 x0  0  . . . .   . . . . .     1C∞(A, B) is the set of all smooth functions from A to B, and L(A, B) is  ∂fm ∂fm ... ∂fm   ∂z1 ∂z2 ∂zn  the set of all linear functions from A to B, for Euclidean spaces A and B.  x0 x0 x0    2in the sense of being easily differentiable       2 Differential-form Pullback Programming Language and Reverse-mode AD

g ∗ (−)2 elementary functions as R2 −−→ R2 −−→ R −−−→ R, where Whenever this is the case, reverse-mode AD is more efficient g(hx,yi) := hx + 1, 2x + y2i. Then, starting from hh3, 1i| Ii, than forward-mode. iterate the h·|·i-reduction Remark 2.1. Unlike forward-mode AD, we cannot inter- + + 2 1 0 g 1 1 2∗1 3 1 0 ∗ 2∗11 leave the iteration of xi and the computation of βi .Infact,ac- , , 15 12 hh1 3i| 0 1 i −→hh 2 11 i| 2 6 i −→h 22 | i     cording to Hoffmann [18], nobody knows how to do reverse- (−)2 222   −−−→h484 | 660 528 i mode AD using pairs h·|·i, as employed by forward-mode AD to great effect. In other words, reverse-mode AD does 660 528  h , i yielding as the Jacobian of f at 1 3 . Notice that not seem presentable as an in-place algorithm. (1 + 1) is only evaluated once, even though its result is used in various calculations. 2.2 Geometric Perspective of Reverse-mode AD In practice, because storing the intermediate matrices α i Reverse-mode AD can naturally be expressed using pull- can be expensive, the matrix J(f )(x ) is computed column- 0 backs and differential 1-forms, as alluded to by Betancourt by-column, by simply changing the starting pair from hx | Ii 0 [7] and discussed in [26]. to hx | e i, where e ∈ Rn is the column vector with 1 0 p p Let E := Rn and F := Rm .A differential 1-form of E is a at the p-th position and 0 everywhere else. Then, the com- smooth map ω ∈ C∞(E, L(E, R)). Denote the set of all differ- putation becomes a reduction of a vector-vector pair, and ential 1-forms of E as ΩE. E.g. λx.π ∈ Ω Rm . (Henceforth, α = J(f )(x )× e is the p-th column of the Jacobian ma- p k 0 p by 1-form, we mean differential 1-form.) The pullback of a trix J(f )(x ). Since J(f )(x ) is a m-by-n matrix, n runs are 0 0 1-form ω ∈ ΩF along a smooth map f : E → F is a 1-form required to compute the whole Jacobian matrix. Ω(f )(ω) ∈ Ω E where , 1 For example, if we start from hh1 3i| 0 i, the reduction   Ω(f )(ω) : E −→ L(E, R) 2 ∗ 1 g 1 ∗ (−) x 7−→ (J(f )(x)) (ω(f x)) , , 15 660 hh1 3i| 0 i −→hh2 11i| 2 i −→h22 | i −−−→h484 | i         Notice the result of an iteration of reverse-mode AD gives us the first columnof the Jacobian matrix J(f )(h1, 3i). ∗ (J(f )(x0)) (πp ) can be expressed as Ω(f )(λx.πp)(x0), which can be expanded to Ω(g1)◦···◦ Ω(gk ) (λx.πp )(x0). Hence, Reverse-mode AD By contrast, reverse-mode AD com- reverse-mode AD can be expressed as: first iterate the reduc- ∗ g  putes the dual of the Jacobian matrix, (J(f )(x0)) , using the tion of 1-forms, ω −→ Ω(g)(ω), for g = gk ,...,g1, starting chain rule in dual (transpose) form from the 1-form λx.πp ; then compute ω0(x0), which yields ∗ ∗ ∗ the p-th row of J(f )(x0). (J(f )(x0)) = (J(g1)(x0)) ×···×(J(gk )(xk−1)) Returning to our example, as follows: first compute xi := gi (xi−1) for i = 1,..., k − 1 ∗ Ω(f )(λx. 1 )(h1, 3i) (Forward Phase); then compute βi := (J(gi )(xi−1)) × βi+1 = Ω(g) ◦ Ω(∗) ◦ Ω((−)2) (λx. 1 ∗)(h1, 3i) for i = k,..., 1 with β + := I (Reverse Phase).   k 1 = ∗ Ω Ω 2 ∗ For example, the reverse-mode AD computation on f is (J(g)(h1, 3i)) (∗) ◦ ((−) ) (λx. 1 )(h2, 11i) ∗  ∗ = (J(g)(h1, 3i)) (J(∗)(h2, 11i)) Ω((−)2) (λx. 1 ∗)(22) as follows.    = ∗ ∗ 2 ∗ ∗ g ∗ (−)2 (J(g)(h1, 3i)) (J(∗)(h2, 11i)) (J((−) )(22)) (λx. 1 )(484) ∗ ∗  Forward Phase: h1, 3i −→h2, 11i −→ 22 −−−→ 484 = (J(g)(h1, 3i)) (J(∗)(h2, 11i)) 44 ∗ 2   660 g 484 ∗ (−) ∗  Reverse Phase: ←− ←− 44 ←−−− I = , ∗ 484   528 88 (J(g)(h1 3i)) 88     ∗   In practice, like forward-mode AD,  the matrix = 660 528 ∗ (J(f )(x0)) is computed column-by-column, by sim-   m which is the Jacobian J(f )(h1, 3i). ply setting βk+1 := πp , where πp ∈ L(R , R) is the p-th projection. Thus, a run (comprising Forward and Reverse The pullback-of-1-forms perspective gives us a way to ∗ perform reverse-mode AD beyond Euclidean spaces (for ex- Phase) computes (J(f )(x0)) (πp ), the p-th row of the ample on the function sum : List(R) → R, which returns Jacobian of f at x0. It follows that m runs are required to compute the m-by-n Jacobian matrix. the sum of the elements of a list); and it shapes our language In many machine learning (e.g. deep learning) problems, and reduction presented in Section 3. (Example 3.2 shows the functions f : Rn → Rm we need to differentiate have how sum can be defined in our language and Appendix A.2 many more inputs than outputs, in the sense that n ≫ m. shows how reverse-mode AD can be performed on sum.)

3 Carol Mak and Luke Ong

lin(S) := ∅ otherwise. Simple terms S ::= x | λx.S | S P | πi (S) | hS, Si | r | f (P) ∗ ( ( )) = { } | J f · S | (λx.S) · S | (Ω (λx.P)) · S | r∗ For example, lin x z y z x . P = S S + P Pullback terms :: 0 | | 3.1.2 Dual Type, Jacobian, Dual Map and Pullback Any term of the dual type σ ∗ is considered a linear func- Figure 1. Grammar of simple terms S and pullback terms P. tional of σ. For example, e ∗ has the dual type Rn∗. Then Assume a collection V of variables (typically x,y, z, ω), and p ∗ Rn, R a collection F (typically f ,g,h) of easily-differentiable real- the term ep mimics the linear functional πp ∈ L( ). valued functions, in the sense that the Jacobian of f , J(f ), The Jacobian J f · S is considered as the Jacobian of f can be called by the language, r and r range over R and Rn along S, which is a smooth function. For example, let f : respectively. Rm → Rn be “easily differentiable”, then J f ·v mimics the Jacobian along v, i.e. the function λx.J(f )(x)(v). S ∗ S Remark 2.2. Pullbacks can be generalised to arbitrary p- The dual map (λx. 1) · 2 is considered the dual of S S forms, using essentially the same approach. However the the linear functional 2 along the function λx. 1, where S r Rm pullbacks of general p-forms no longer resemble reverse- x ∈ lin( 1). For example, let ∈ . The dual map r ∗ ∗ r ∗ Rm R mode AD as it is commonly understood. (λv.(J f · v) ) · ep mimics (J(f )( )) (πp ) ∈ L( , ), which is the dual of πp along the Jacobian J(f )(r). The pullback (Ω λx.P)·S is considered the pullback of the 3 A Differential-form Pullback 1-form S along the function λx.P. For example, (Ω λx.f (x))· Programming Language ∗ m (λx.ep ) mimics Ω(f )(λx.πp) ∈ Ω(R ), which is the pull- 3.1 Syntax back of the 1-form λx.πp along f . Figure 1 presents the grammar of simple terms S and pull- Hence, to perform reverse-mode AD on a term λx.P at P′ back terms P, and Figure 2 presents the type system. While with respect to ω, we consider the term (Ω λx.P)· ω P′. S the definition of simple terms is relatively standard (ex-  cept for the new constructs which will be discussed later), 3.1.3 Notations the definition of pullback terms P as sums of simple terms We use syntactic sugars to ease writing. For n ≥ 1 and z a is not. fresh variable. n+1 n Ω ∗ 3.1.1 Sum and Linearity R ≡ R × R σ ≡ σ ⇒ σ r1 . The idea of sum is important since it specifies the “linear . ≡ hr1,..., rni hP1, P2, P3i≡hhP1, P2i, P3i   positions” in a simple term, just as it specifies the algebraic rn   notion of linearity in . For example, x(y + z) is S  ≡ π (S) let x = t in s ≡(λx.s) t  πi i   ∗ a term but (x +y)z is not. This is because (x +y)z is the same Ω r ≡ λx.r λhx,yi.S ≡ λz.S[zπ 1/x][zπ 2/y] + + S P S as xz yz, but x(y z) cannot. Hence in , is in a linear Capture-free substitution is applied recursively, e.g. position but not P. Similarly, in Mathematics (f1 + f2)(x1) = ∗ ′ ′ ∗ ′ (λx.S1) · S2 [P /z] ≡ (λx.S1[P /z]) ·(S2[P /z]) and + + , + f1(x1) f2(x1) but in general f1(x1 x2) f1(x1) f1(x2) (Ω λx.P)· S [P′/z]≡(Ω (λx.P[P′/z]))·(S[P′/z]). We treat , ,  for smooth functions f1 f2 and x1 x2. Hence, the function 0 as the unit of our sum terms, i.e. 0 ≡ 0 + 0, S ≡ 0 + S and  f in an application f (x) is in a linear position while the S ≡ S + 0; and consider + as a associative and commutative argument x is not. operator. We also define S[S + S /y]≡ S[S /y] + S[S /y] if S 1 2 1 2 Formally we define the set lin( ) of linear variables in a and only if y ∈ lin(S). For example, (S + S ) P ≡ S P + S P. S S 1 2 1 2 simple term by y ∈ lin( ) if, and only if, y is in a linear We finish this subsection with some examples that can be S position in . expressed in this language. lin(x) := {x} lin(λx.S) := lin(S) \ {x} Example 3.1. Consider the running example in computing + + 2 2 lin(S P) := lin(S) \ FV(P) the Jacobian of f : hx,yi 7→ (x 1)(2x y ) at h1, 3i. S = S Assume g(hx,yi) := hx +1, 2x +y2i, mult and pow2 are in the lin(πi ( )) : lin( )  , , lin(hS1, S2i) := lin(S1) ∩ lin(S2) set of easily differentiable functions, i.e. g mult pow2 ∈ F . lin(J f · S) := lin(S) The function f can be presented by the term {hx,yi : R2}⊢ ∗ lin((λx.S1) · S2) := lin(S1) \ FV(S2) ∪ lin(S2) \ FV(S1) pow2(mult(g(hx,yi))) : R. More interestingly, the Jacobian 4   Differential-form Pullback Programming Language and Reverse-mode AD

∗ σ, τ ::= R | σ1 × σ2 | σ1 ⇒ σ2 | σ Γ ⊢ S : σ Γ ⊢ P : σ Γ ∪{x : σ } ⊢ S : τ Γ ⊢ S : σ ⇒ τ Γ ⊢ P : σ Γ ⊢ 0: σ Γ ⊢ S + P : σ Γ ∪{x : σ }⊢ x : σ Γ ⊢ λx.S : σ ⇒ τ Γ ⊢ SP : τ n n n Γ ⊢ S : σ1 × σ2 Γ ⊢ S1 : σ1 Γ ⊢ S2 : σ2 r ∈ R Γ ⊢ P : R Γ ⊢ S : R r ∈ R Γ m n m ∗ n∗ Γ ⊢ πi (S) : σi Γ ⊢ hS1, S2i : σ1 × σ2 ⊢ r : R Γ ⊢ f (P) : R Γ ⊢ J f · S : R ⇒ R Γ ⊢ r : R ∗ Γ ∪{x : σ } ⊢ S1 : τ Γ ⊢ S2 : τ x ∈ lin(S1) Γ ∪{x : σ } ⊢ P : τ Γ ⊢ S : Ωτ ∗ ∗ Γ ⊢(λx.S1) · S2 : σ Γ ⊢(Ω λx.P)· S : Ωσ

Figure 2. The types and typing rules for DPPL. Ωσ ≡ σ ⇒ σ ∗ and f : Rn → Rm is easily differentiable, i.e. f ∈ F . of f at h1, 3i, i.e. J(f )(h1, 3i), can be presented by the term 3.2 Divide: Administrative Reduction ∗ ⊢ (Ω λhx,yi.pow2(mult(g(hx,yi)))) · (Ω 1 ) h1, 3i : R2 . We use the administrative reduction (A-reduction) of Sabry and Felleisen [28] to decompose a pullback term P into a let This is the application of the pullback Ω(f )( λx . 1 ∗) to the series L of elementary terms, i.e. point h1, 3i, which we saw in Subsection 2.2 is the  Jacobian P ∗ = E = E of f at h1, 3i. −→A let x1 ; ... ; xn in xn, where elementary terms E and let series L are defined as E = + .L , Example 3.2. Consider the function that takes a list of real :: 0 | z1 z2 | z | λx | z1 z2 | zi | hz1 z2i | r | f (z) .L ∗ Ω .L r∗ numbers and returns the sum of the elements of a list. Us- | J f · z | (λx ) · z | ( λx )· z | L = = E L = E . ing the standard Church encoding of List, i.e. List(X ) ≡ :: let z in | let z in z (X → D → D) → (D → D), and [x1, x2,..., xn] ≡ Note that elementary terms E should be “fine enough” to λfd.f xn ... (f x2 (f x1 d)) for some dummy type D, sum : avoid expression swell. The complete set of A-reductions List(R) → R is defined to be λl.l (λxy.x +y) 0. Hence the Ja- on P can be found in Appendix B. We write −→∗ for the  A cobian of sum at a list [7, −1] can be expressed as reflexive and transitive closure of −→A. {ω : Ω(List(R))} ⊢ (Ω (sum)) · ω [7, −1] : R∗. Example 3.3. We decompose the term considered in Exam- ple 3.1, pow2(mult(g(hx,yi))), via administrative reduction. 

let z1 = hx,yi; Now the question is how we could perform reverse-mode z = g(z ); pow2(mult(g(hx,yi))) −→∗ 2 1 AD on this term. Recall the result of a reverse-mode AD on a A z3 = mult(z2); = function f : Rn → Rm at x ∈ Rn , i.e. the p-th row of the Ja- z4 pow2(z3) in z4. cobian matrix of f at x, can be expressed as Ω(f )(λx.π )(x), g p This is reminiscent of the decomposition of f into R2 −−→ which is (J(f )(x))∗((λx.π )(f x)) = (J(f )(x))∗ × π . p p ∗ (−)2 In the rest of this Section, we consider how the term R2 −−→ R −−−→ R before performing AD. ((Ω λy.P′)· ω) P, which mimics Ω(f )(ω)(x), can be reduced. 3.3 Conquer: Pullback Reduction To avoid expression swell, we first perform A-reduction: P′ ∗ L 3.3.1 Let Series −→A which decompose a term into a series of “smaller” terms, as explained in Subsection 3.2. Then, we reduce After decomposing P′ to a let series L of elementary terms ((Ω λy.L)· ω) P by induction on L, as explained in Subsec- via A-reductions in (Ω λy.P′)· ω, we reduce (Ω λy.L)· ω by tion 3.3. Lastly, we complete our reduction strategy in Sub- induction on L as shown in Figure 3 (Let series). Reduction 7 section 3.4. is the base case and reduction 8 expresses the contra-variant We use the term in Example 3.1 as a running example property of pullbacks. in our reduction strategy to illustrate that this reduction is Example 3.4. Take (Ω λhx,yi.pow2(mult(g(hx,yi)))) · faithful to reverse-mode AD (in that it is exactly reverse- (Ω 1 ) discussed in Example 3.1, as when applied to mode AD when restricted to first-order). The reduction of the point h1, 3i is the Jacobian J(f )(h1, 3i) where the term in Example 3.2 is given in Appendix A.2. It illus-   f (hx,yi) := (x + 1)(2x + y2) 2. In Example 3.3, we showed trates how reverse-mode AD can be performed on a higher- that pow2(mult(g(hx,yi))) is A-reduced to a let series L. order function.  Now via reduction 7 and 8, (Ω λhx,yi.L)· ω is reduced to a 5 Carol Mak and Luke Ong

(7) Let Series: (Ω (λy.let x = E in x)) · ω −−→(Ω λy.E)· ω (8) (Ω (λy.let x = E in L)) · ω −−→(Ω λy.hy, Ei)· (Ω λhy, xi.L)· ω (9) Constant Functions: (Ω λy.E)· ω V −−→ 0 for y < FV(E ). 

(10a) ∗ Linear Functions: (Ω λy.z + yπ j )· ω V −−−−→(λv.vπ j ) · ω (z + Vπ j )

(10b) ∗ (Ω λy.yπi + yπ j )· ω V −−−−→(λv.vπi + vπ j ) · ω (Vπi + Vπ j ) (11) (Ω λy.y)· ω V −−−→(λv.v)∗ ·(ω V) 

(12) ∗ (Ω λy.yπi )· ω V −−−→(λv.vπi) ·(ω Vπi )

(13a) ∗ (Ω λy .hyπi , zi) · ω V −−−−→(λv.hvπi, 0i) ·(ω hVπi , zi)

(13b) ∗ (Ω λy.hz,yπ j i) · ω V −−−−→(λv.h0,vπ j i) ·(ω hz, Vπ j i)

(13c) ∗ (Ω λy.hyπi ,yπ j i) · ω V −−−→(λv.hvπi ,vπ j i) ·(ω hVπi , Vπ j i)

(14) ∗ (Ω λy.J f · yπi )· ω V −−−→(λv.J f · vπi ) · ω (J f · Vπi )  (15) ∗  Function Symbols: (Ω λy.f (yπi )) · ω V −−−→(λv.(J f · vπi )Vπi ) · ω (f (Vπi )) ∗  (16a) ∗ ∗ ∗  Dual Maps: (Ω λy.(λx.L) · yπi )· ω V −−−−→(λv.(λx.L) · vπi ) · ω ((λx.L) · Vπi ) if y < FV(λx.L)  (Ω λy.L)· ω′ V −→∗ (λv.S)∗ · ω′ V′ y < FV(z)  (16b) (Ω λy.(λx .L)∗ · z)· ω V −−−−→ λv.(λx.S)∗ · z ∗ · ω (λx.L[V/y])∗ · z Ω .L ′ V ∗ .S ∗ ′ V′ (  λy )· ω −→ (λv ) · ω  ∗ (16c) (Ω λy.(λx.L) · yπi )· ω V −−−→ ∗ ∗ ∗ ∗ λv.(λx.L[V/y]) · vπi + (λx.S) · Vπi · ω (λx.L[V/y]) · Vπi  Ω .L ∗ .S ∗ L ( λx )· z) a −→ (λv ) ·(z  [a/x])  Pullback Terms: (17) (Ω λy.(Ω λx.L)· z)· ω V −−−→ (Ω λy.λa.(λv.S)∗ ·(z L[a/x])) · ω V Ω .L ′ V ∗ .S ∗ ′ L V < V ( λy )· ω −→ (λv ) ·(ω [ /y]) x FV( )  Abstraction: (18) (Ω λy.λx.L )· ω V −−−→(λv.λx.S)∗ · ω (λx.L[V/y])

(19a) ∗ Application: (Ω λy.yπi z)·ω V −−−−→(λv.vπi z) ·(ω (Vπi z))  Ω P′ ′ V ′ S′ ∗ ′ P′ V V P′ ( λz . )· ω π j −→ (λv . ) · ω ( [ π j /z]) πi ≡ λz. (19b) ′ ′ ∗ (Ω λy.yπi yπ j )·ω V −−−−→(λv.vπi Vπ j + S [vπ j /v ]) · ω (Vπi Vπ j ) Ω .V′ ′ V V .V′ ( λz )·ω π j −→ 0 πi ≡ λz  (19c) ∗ (Ω λy.yπi yπ j )· ω V−−−→(λv.vπi Vπ j ) · ω (Vπi Vπ j ) Ω .E ′ V .S ∗ ′ E V E ( λy )· ω −→ (λv ) ·(ω ( [ /y])) y ∈ FV( ) Pair: (20a) (Ω λy.hy, Ei)· ω V −−−−→(λv.hv, Si)∗ ·(ω hV, E[V/y]i) (20b) (Ω λy.hy, Ei) · ω V −−−−→(λv.hv, 0i)∗ ·(ω hV, Ei) for y < FV(E)  Figure 3. Pullback Reductions

6 Differential-form Pullback Programming Language and Reverse-mode AD series of pullback along elementary terms. (Ω λy.E)· ω V is reduced to (λv.S)∗ ·(ω (E[V/y])) where S E let z1 = hx,yi; is the result of substituting y by v in . Figure 3 (Linear =  z2 g(z1); Functions) Reductions 10-14 shows how they are reduced. (Ω λhx,yi. )· ω z3 = mult(z2); z = pow2(z ) in z 4 3 4 3.3.4 Smooth Functions (Ω λhx,yi.hhx,yi, hx,yii) · Ω ∗ ( λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · Now consider the redexes where y might not be a linear −→ Ω ( λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2)i) · variable in E. All reductions are shown in Figure 3. ©(Ω λhhx,yi, z , z , z i.pow2(z )) · ω ª ­ 1 2 3 3 ® ­ ® Via A-reductions­ and reductions 7 and 8, (Ω λy.P® ′) · ω Function Symbols Let f be “easily differentiable”. Then, is reduced to« a series of pullback along elementary¬ terms λy.f (yπi ) is mimicking f ◦ πi , whose Jacobian at x is J( )( ( )) ◦ . ( ) (Ω λy.E1)·(... ((Ω λy.En)· ω)). Now, we define the reduc- f πi x πi . Hence the Jacobian of λy f yπi is tion of pullback along elementary terms when applied to a λv.(J f · vπi )Vπi and (Ω λy.f (yπi )) · ω V is reduced to 3 V ((Ω .E)· ) V (λv.(J f · v )V )∗ ·(ω (f (V ))) as shown in Reduction value , i.e. λy ω . πi πi πi  Recall the pullback of a 1-form ω ∈ Ω(F) along a smooth 15. function f : E → F is defined to be Dual Maps Consider the Jacobian of λy.(λx.L)∗ ·z at V. It Ω ∗ (f )(ω) : x 7−→ (J(f )(x)) (ω(f x)). is easy to see that the result varies depending on where the ∗ Hence, we have the following pullback reduction variable y is located in the dual map (λx.L) ·z. We consider three cases. (Ω λy.E)· ω V −→ (λv.S)∗ ·(ω (E[V/y])) First, if y < FV(λx.L), we must have z ≡ yπi . Then y Ω E V ∗ of the application (  λy. )· ω which mimics the pull- is a linear variable in (λx.L) · yπi and so the Jacobian of back of a variable ω along an abstraction λy.E at a term V. λy.(λx.L)∗ · y at V is λv.(λx.L)∗ · v . Hence, we have  πi πi But how should one define the simple term S in (λv.S)∗ · Reduction 16a. (ω (E[V/y])) so that λv.S mimics the Jacobian of f at x, i.e. Second, sayy < FV(z). Since dual and abstraction are both J(f )(x)? We do so by induction on the elementary terms E, linear operations, and y is only free in L, the Jacobian of shown in Figure 3 Reductions 9-20. λy.(λx.L)∗ · z at V. should be λv.(λx.S′)∗ · z where λv.S′ is the Jacobian of λy.L at V. To find the Jacobian of λy.L at Remark 3.5. For readers familiar with differential λ- V, we reduce (Ω λy.L)· ω V to (λv.S′)∗ ·(ω L[V/y]). Then calculus [15], S is the result of substituting a linear occur- λv.S′ is the Jacobian of λy.L at V. The reduction is given in rence of y by v, and then substituting all free occurrences  Reduction 16b. Note that this reduction avoids expression of y by V in the term E. Our approach is different from dif- swell, as we are reducing the let series L in λy.(λx.L)∗ · z ferential λ-calculus in that we define a reduction strategy using our pullback reductions, which does not suffer from instead of a substitution. A comprehensive comparison be- expression swell. tween our language and differential λ-calculus is given in Finally, for y ∈ FV(λx.L) ∩ FV(z), the Jacobian of Section 5. λy.(λx.L)∗ ·z at V is the “sum” of the results we have for the ∗ + ∗ 3.3.2 Constant Functions two cases above, i.e. λv.(λx.L) · vπi (λx.S) · yπi , where the remaining free occurrences of y are substituted by V, If y is not a free variable in E, λy.E is mimicking a constant since the Jacobian of a bilinear function l : X × X → Y function. The Jacobian of a constant function is 0, hence we 1 2 is J(l)(hx , x i(hv ,v i) = l hx ,v i + l hv , x i. Hence, we reduce (Ω λy.E)· ω V to (λv.0)∗ ·(ω (E[V/y])), which is 1 2 1 2 1 2 1 2 have Reduction 16c. the sugar for 0 as shown in Figure 3 (Constant Functions)  Ω V Ω V Reduction 9. The redexes ( λy.0)· ω , ( λy.r)· ω Pullback Terms Consider (Ω λy.(Ω λx.L)· z)· ω V. and (Ω λy.r∗)· ω V all reduce to 0. ∗   Instead of reducing it to some (λv.S) · Henceforth, we assume y ∈ FV(E).   (ω ((Ω λy.(Ω λx.L)· z)· ω)[V/y]) like the others, here we simply reduce (Ω λx.L)· z) a to (λv.S)∗ ·(z L[a/x]), 3.3.3 Linear Functions where a is a fresh variable and z . x, and replace (Ω λx.L)·z E E We consider the redexes where y ∈ lin( ). Then λy. is by λa.(λv.S)∗ ·(z L[a/x]) in (Ω λy.(Ω λx.L)· z)· ω V as mimicking a linear function, whose Jacobian is itself. Hence shown in Reduction 17.  3A value is a normal form of the reduction strategy. Its definition will be made precise in the next subsection. Abstraction Consider the Jacobian of λy.λx.L at V. 7 Carol Mak and Luke Ong

We follow the treatment of exponentials in differential Having these terms as values makes sense intuitively, λ-category [11] where the (D-curry) rule states that for all since they have “inappropriate” values in positions. Values f : Y × X → A, D[cur(f )] = cur(D[f ] ◦ hπ1 × 0X , π2 × IdX i), 1 has a free variable z in a function position. Value 2 substi- which means J(cur(f ))(y) is equal to tutes yπi by Vπi which is a non-abstraction, to a function position. λv.J(cur(f ))(y)(v) = λvx.J(f h−, xi)(y)(v). According to this (D-curry) rule, the Jacobian of λy.λx.L at V should be λv.λx.S where λv.S is the Jacobian of Pair Last but not least, we consider the Jacobian of λy.L at V. Hence similar to the dual map case, we first λy.hy, Ei at V. It is easy to see that Jacobian is λv.hv, Si reduce (Ω λy.L)· ω V to (λv.S)∗ ·(ω L[V/y]) and ob- where λv.S is the Jacobian of λy.E, as shown in Reduction tain the Jacobian of λy.L at V, i.e. λv.S and then reduce 20a and Reduction 20b.  (Ω λy.λx.L)· ω V to (λv.λx.S)∗· ω (λx.L[V/y]) as shown in Reduction 18.   Example 3.7. Take our running example. In Examples 3.3 Application Consider the Jacobian of λy.z1 z2 at V. Note and 3.4 we showed that via A-reductions and Reductions 7 that z1 and z2 may or may not contain y as a free variable. and 8, (Ω λhx,yi.pow2(mult(g(hx,yi)))) · ω is reduced to Hence, there are two cases. (Ω λhx,yi.hhx,yi, hx,yii) · . ∈ First, we consider λy yπi z where z is fresh. Since y (Ω λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · lin(yπi z), λy.yπi z mimics a linear function, and hence its (Ω λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2)i) · Ω ©(Ω λhhx,yi, z1, z2, z3i.pow2(z3)) · ω ª Jacobian at V is λv.vπi z. So ( λy.yπi z)· ω V is reduced ­ ® ∗ ­ ® to (λv.vπi z) ·(ω (Vπi z)) as shown in Reduction 19a. ­ ® ,  We show how it can be reduced when applied to h1 3i. . V « ¬ Second, we consider the Jacobian of λy yπi yπ j at . Now (Ω λhx,yi.hhx,yi, hx,yii) · Ω y is not a linear variable in yπi yπ j , since it occurs in the ( λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · 1 Ω argument yπ j . As proved in Lemma 4.4 of [21], every differ- ( λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2)i) · 3 ©(Ω λhhx,yi, z , z , z i.pow2(z )) · ω ª   ential λ-category satisfies the (D-eval) rule, D[ev ◦ hh,gi] = ­ 1 2 3 3 ® ­ (λhv ,v i.hhv ,v i, hv ,v ii)∗ · ® ev ◦ hD[h],g ◦ π i + D[uncur(h)]◦ hh0, D[g]i, hπ ,g ◦ π ii ­ 1 2 1 2 1 2 ® 2 2 2 20.1 (Ω λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · −−−−→ 1 1 which means J(ev ◦ hh,gi)(x)(v) is equal to « (Ω λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2¬)i) · h , i 11 Ω 3 3 © ( λhhx,yi, z1, z2, z3i.pow2(z3)) · ω     ª J(h)(x)(v) (g(x)) + J(h(x))(g(x))(J(g)(x)(v)) ­© ∗ ª ® ­(­λhv1,v2i.hhv1,v2i, hv1,v2ii) · ® ® 20.1 (λhhv ,v i,v i.hhv ,v i,v , (Jg · v )h1, 3ii)∗ · →( ⇒ ) → −−−−→ «« 1 2 3 1 2 3 3 ¬ ¬ for allh : C A  B and g : C A. Hence, the Jacobian (Ω λhhx,yi, z , z i.hhx,yi, z , z , mult(z )i) · 14,3 1 2 1 2 2 h 1 , 1 , 2 i of ev ◦ hπi , πj i at x along v, i.e. J(ev ◦ hπi , πj i)(x)(v), is © ((Ω λhhx,yi, z1, z2, z3i.pow2(z3)) · ω) 3 3 11 ª ­       ® ­  ® ( )( ( )) + J( ( ))( ( ))( ( )). ­ (⋆)® πi v πj x πi x πj x πj v ∗ «(λhv1,v2i.hhv1,v2i, hv1,v2ii) · ¬ . V . V +S′ ′ , , . , , , , ∗ So the Jacobian of λy yπi yπ j at is λv vπi π j [vπ j /v ] 20.1 (λhhv1 v2i v3i hhv1 v2i v3 (Jg · v3)h1 3ii) · ′ ′ ∗ where λv .S is the Jacobian of Vπi at Vπ j . Hence assum- −−−−→ (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · 14,3 1 1 2 ′ Ω ′ © (Ω λhhx,yi, z1, z2, z3i.pow2(z3)) · ω , , , ª ing Vπi ≡ λz.P , we first reduce ( λz.P )· ω Vπ j to ­ h 3 3 11 22i ® ­       ® ( ′.S′)∗ · (P′[V / ]) ′.S′ ­  ∗  ® λv ω π j z and obtain λv as the Jacobian ­(λhv1,v2i.hhv1,v2i, hv1,v2ii) ·  ® ′  ∗ of λz.P at V . Then, we reduce (Ω λy.y y )· ω V to 20.1 (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · π j πi π j −−−−→ « ¬ ′ ′ ∗ (λhhv ,v i,v ,v i.hhv ,v i,v ,v , (Jmult · v )h2, 11ii)∗ · (λv.vπi Vπ j + S [vπ j /v ]) · ω (Vπi Vπ j ) as shown in Re- 14,3 1 2 3 4 1 2 3 4 4  ©(λhhv ,v i,v ,v ,v i.(J pow2 · v )22)∗ ·(ω 484) ª ­ 1 2 3 4 5 5 ® duction 19b. ­ ® ′  ′ ­ ® If (Ω λz.V )· ω Vπ j reduces to 0, which means λz.V ≡ Notice how this is reminiscent of the forward phase of « , + + ¬ Vπi is a constant function, the Jacobian of λy.yπi yπ j at V reverse-mode AD performed on f : hx yi 7→ (x 1)(2x  2 . V y2) at h1, 3i considered in Subsection 2.1. is just λv vπi π j and we have Reduction 19c. 3 Moreover, we used the reduction f (r) −→ f (r) couples of Remark 3.6. Doing induction on elementary terms defined times in the argument position of an application. This is to in Subsection 3.2, we can see that there are a few elementary avoid expression swell. Note 1 + 1 is only evaluated once in E Ω E V terms where ( λy. )· ω is not a redex, namely (⋆) even when the result is used in various computations. Hence, we must have a call-by-value reduction strategy as value 1: (Ω λy.z yπi )· ω Vwhere z is a free variable, presented below. value 2: (Ω λy.y y )· ω) V where V . λz.P′. πi π j  πi 8 Differential-form Pullback Programming Language and Reverse-mode AD

3.4 Combine Example 3.9. Consider our running example P ≡ Ω Ω Reductions in Subsections 3.2 and 3.3 are the most interest- ( λhx,yi.pow2(mult(g(hx,yi)))) · 1 h1, 3i which rep- + + 2 2 ing development of the paper. However, they alone are not resents the Jacobian of f : hx,yi 7→ (x 1)(2x y ) enough to complete a reduction strategy. In this subsection, at h1, 3i, as shown in Example 3.1. Replacing ω by Ω 1 ≡ ∗  we define contexts and redexes so that any non-value term λx. 1 in Examples 3.3, 3.4 and 3.7, P is reduced to   can be reduced. ∗ (λhv1,v2i.hhv1,v2i, hv1,v2ii) · ∗ The definition of context C is the standard call-by-value (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · ∗ . context, extended with duals and pullbacks. Notice that the (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · ©(λhhv ,v i,v ,v ,v i.(J pow2 · v )22)∗ ·(ω 484) ª context (Ω λy.C )· S contains a A-context defined in Sub- ­ 1 2 3 4 5 5 ® A ­ ® section 3.2. This follows from the idea of reverse-mode AD Via­ reduction 5 and β reduction, P is reduced to ® « ∗ ¬ to decompose a term into elementary terms before differen- (λhv1,v2i.hhv1,v2i, hv1,v2ii) · ∗ tiating them. (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · ∗ (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · ∗ C ::= [] | C + P | V + CA | C P | VC | πi (C) | hC, Si | hV,Ci ©(λhhv ,v i,v ,v ,v i.(Jpow2 · v )22) · 1 ª ­ 1 2 3 4 5 5 ® | f (C) | J f · C |(λx.S)∗ · C | (λx.C)∗ · V ­ ® ∗ ­   ® 0 |(Ω λy.C )· S | (Ω λy.E)· C | (Ω λy.hy, Ei)· C ∗ 0 A « (λhv1,v2i.hhv1,v2i, hv1,v2ii) · ¬ 0 ∗   −→ (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) ·  0  Our redex r extend the standard call-by-value redex with   (λhhv ,v i,v ,v i.hhv ,v i,v ,v , (Jmult · v )h2, 11ii)∗ ·  0  1 2 3 4 1 2 3 4 4   four sets of terms. © ª 0  ­ ®44   ∗   « 0 ¬    r ::= (λx.S) V | π (hV , V i) | f (r)| (J f · r) r′ ∗ 0   i 1 2 (λhv1,v2i.hhv1,v2i, hv1,v2ii) · 0 ∗ ′∗ ∗ ∗ −→ ∗   |(λv.(J f · v) r) · r | (λv .V ) · (λv .V ) · V (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) ·  0  1 1 2 2 3     484 |(Ω λy.L)· S | (Ω λy.E)· V1 V2 | (Ω λy.hy, Ei)· V1 V2     88  ∗   ′∗ 0   where either V2 . (J f · v2) ror V3 . r .A value Vis a   ∗ 0   P −→ (λhv1,v2i.hhv1,v2i, hv1,v2ii) · pullback term that cannot be reduced further, i.e. a term 660 528 in normal form.    ∗   660   The following standard lemma, which is proved by induc- −→   528   tion on P, tells us that there is at most one redex to reduce.   Notice how this mimics the reverse phase of reverse-mode AD on f : hx,yi 7→ (x + 1)(2x + y2) 2 at h1, 3i considered Lemma 3.8. Every term P can be expressed as either C[r] for in Subsection 2.1. some unique context C and redex r or a value V.  Examples 3.3, 3.4 and 3.7 demonstrates that our reduction Let’s look at the reductions of redexes. (1-4) are the stan- strategy is faithful to reverse-mode AD (in that it is exactly dard call-by-value reductions. (5) reduces the dual along a reverse-mode AD when restricted to first-order). l and (6) is the contra-variant property of dual maps. 3.5 Continuation-Passing Style (1) (2) Differential 1-forms ΩE := C∞(E, L(E, R)) is similar to the (λx.S) V −−→ S[V/x] πi (hV1, V2i) −−→ Vi continuation of E with the “answer” R. We can indeed write (3) ′ (4) ′ f (r) −−→ f (r) (J f · r) r −−→J(f )(r )(r) our reduction in a continuation passing style (CPS) manner. ∗ (5) ∗ ∗ P S Ω P S P S (λv.(J f · v) r) · r′∗ −−→((J(f )(r)) (r′)) Let h | iy ≡ ( λy. )· , then we can treat h | iy as a Γ ∪ { } ⊢ P ∗ ∗ (6) ∗ configuration of an element y : σ : τ and a “con- (λv1.V1) · (λv2.V2) · V3 −−→(λv1.V2[V1/v2]) · V3 tinuation” Γ ⊢ S : Ωτ . The rules for the redexes hL | Siy , ′∗ E V V E V V where either V2 . (J f · v2) ror V3 . r . h | 1iy 2 and hhy, i| 1iy 2 can be directly converted We say C[r] −→ C[V] if r −→ V for all reductions except from Reductions 7-20. For example, Reduction 8 can be writ- for those with a proof tree, i.e. Reductions 16b, 16c, 17, 18, ten as hlet x = E in L | ωiy −→ hhy, Ei | hL | ωihy,x iiy . 19b, 19c and 20a, where we have We prefer to write our language without the explicit men- ∗ V r −→ tion of CPS since this paper focuses on the syntactic notion ′ V V V V′ V V V if C[r [ 1/ω][ 2/ ]] −→ C[ [ 1/ω][ 2/ ]] of reverse-mode AD using pullbacks and 1-forms. Also, 1- r −→∗ V r ′ −→ V′ form of the type σ is more precisely described as an element 9 Carol Mak and Luke Ong

of the function type Ωσ ≡ σ ⇒ σ ∗, than of the continuation Differential λ-category A Cartesian differential category of σ, i.e. σ ⇒(σ ⇒ R). is a differential λ-category if • it is Cartesian closed, 4 Model • λ(−) preserves the additive structure, i.e. λ(f + g) = We show that any differential λ-category satisfying the λ(f ) + λ(g) and λ(0) = 0, Hahn-Banach Separation Theorem can soundly model our • D[−] satisfies the (D-curry) rule: for any f : A1×A2 → = language. B, D[λ(f )] λ(D[f ] ◦ hπ1 × 0A2 , π2 × IdA2 i)

4.1 Differential Lambda-Category Linearity A morphism f in a differential λ-category is lin- ear if D[f ] = f ◦ π1. Cartesian differential category [9] aims to axiomatise funda- mental properties of derivative. Indeed, any model of syn- Example 4.2. The category Con∞ of convenient vector thetic differential geometry has an associated Cartesian dif- space and smooth maps, considered by [8], is a differ- ferential category. [13] ential λ-category with the Cartesian differential operator D[f ]hv, xi := lim (f (x + tv) − f (x))/t, as shown in C t→0 Cartesian differential category A category is a Carte- Lemma E.2. sian differential category if • every homset C(A, B) is enriched with a commutative 4.2 Hahn-Banach Separation Theorem C + monoid ( (A, B), AB, 0AB) and the additive structure We say a differential λ-category C satisfies Hahn-Banach is preserved by composition on the left. i.e. (g+h)◦f = Separation Theorem if R is an object in C and for any object g ◦ f + h ◦ f and 0 ◦ f = 0. A in C and distinct elements x,y in A, there exists a linear • it has products and projections and pairings of addi- morphism l : A → R that separates x and y, i.e. l(x) , l(y). tive maps are additive. A morphism f is additive if it ∞ preserves the additive structure of the homset on the Example 4.3. The category Con of convenient vector right. i.e. f ◦(g + h) = f ◦ g + f ◦ h and f ◦ 0 = 0. space and smooth maps satisfies the Hahn-Banach Separa- tion Theorem, as shown in Proposition E.3. and it has an operator D[−] : C(A, B) → C(A × A, B) that satisfies the following axioms: 4.3 Interpretation + = + = [CD1] D is linear: D[f g] D[f ] D[g] and D[0] 0 Let C be a differential λ-category that satisfies Hahn-Banach + = [CD2] D is additive in its first coordinate: D[f ] ◦ hh k,vi Separation Theorem. Since C is Cartesian closed, the inter- + = D[f ] ◦ hh,vi D[f ] ◦ hk,vi, D[f ] ◦ h0,vi 0 pretations for the λ-calculus terms are standard, and hence = = [CD3] D behaves with projections: D[Id] π1, D[π1] π1 ◦ omitted. The full set of interpretations can be found in Ap- = π1 and D[π2] π2 ◦ π1 pendix C. [CD4] D behaves with pairings: D[hf ,gi] = hD[f ], D[g]i JRK := R Jσ1 × σ2K := Jσ1K × Jσ2K [CD5] Chain rule: D[g ◦ f ] = D[g] ◦ hD[f ], f ◦ π2i Jσ ∗K := L(JσK, R) Jσ ⇒ σ K := C(Jσ K, Jσ K) [CD6] D[f ] is linear in its first component: 1 2 1 2 D[D[f ]]◦ hhg, 0i, hh, kii = D[f ] ◦ hg, ki where L(JσK, R) := {f ∈ C(Jσ, RK) | D[f ] = f ◦ π1} is the [CD7] Independence of order of partial differentiation: set of all linear morphisms from JσK to R. D[D[f ]]◦ hh0,hi, hg, kii = D[D[f ]]◦ hh0,gi, hh, kii r ∗ v n J0Kγ := 0 1 1 . = . 7→ We call D the Cartesian differential operator of C. S + P = S + P J . Kγ : . ri vi J Kγ : J Kγ J Kγ     = rn vn i 1     Õ S ∗ S = S S     Example 4.1. The category FVect of finite dimensional J(λx. 1) · 2Kγ : λv.J 2Kγ J 1Khγ ,vi       vector spaces and differentiable functions is a Cartesian dif- J(Ω λx.P)· SKγ := λxv.JSKγ (JPKhγ, xi)(D[cur(JPK)γ ]hv, xi)  ferential category, with the Cartesian differential operator D[f ]hv, xi = J(f )(x)(v), 4.4 Correctness We verify our definitions of linearity and substitution in Cartesian differential operator does not necessarily be- Lemma 4.4 and Lemma 4.5 respectively. have well with exponentials. Hence, Bucciarelli et al. [11] added the (D-curry) rule and introduced differential λ- Lemma 4.4 (Linearity). Let Γ1 ∪ {x : σ1} ⊢ P1 : τ and ∗ category. Γ2 ⊢ P2 : σ . Let γ1 ∈ JΓ1K and γ2 ∈ JΓ2K. Then, 10 Differential-form Pullback Programming Language and Reverse-mode AD

1. if x ∈ lin(P1), then cur(JP1K)γ1 is linear, i.e. The following corollary tells us that our reduction is D[cur(JP1K)γ1] = (cur(JP1K)γ1) ◦ π1, faithful to reverse-mode AD (in that it is exactly reverse- 2. JP2Kγ is linear, i.e. D[JP2Kγ ] = (JP2Kγ ) ◦ π1. mode AD when restricted to first-order) and we can perform reverse-mode AD on any abstraction which might contain Γ S P = Γ Lemma 4.5 (Substitution). J ⊢ [ /x] : τ K J ∪ {x : higher-order terms, duals, pullbacks and free variables. σ }⊢ S : τ K ◦ hIdJΓK, JΓ ⊢ P : σKi Corollary 4.8. Let Γ ∪ {y : σ }⊢ P : τ , Γ ⊢ P : σ, γ ∈ JΓK. Any differential λ-category satisfying Hahn-Banach Sep- 1 2 n m Ω P Ω P ∗ V aration Theorem is a sound model of our language. Note 1. Let σ ≡ R , τ ≡ R . If ( λy. 1)· ep 2 −→ , then that the Hahn-Banach Separation Theorem is crucial in the the p-th row of the Jacobian matrix of JP1Khγ, −i at JP2Kγ ∗  proof. is (JVKγ ) . 2. Let l be a linear morphism from Jτ K to R. If Theorem 4.6 (Correctness of Reductions). Let Γ ⊢ P : σ. (Ω λy.P )· ω P −→∗ (λv.P′ )∗ · ω P′ for some P P′ P = P′ 1 2 1 2 1. −→A implies J K J K. fresh variable ω, then the derivative of l ◦ (JP Khγ, −i) ′ ′  1 2. P −→ P implies JPK = JP K. P P′ at J 2Kγ along some v ∈ JσK is l (J 1Khγ, λx.l,vi) i.e. P P = P′ Proof. The full proof can be found in Appendix E. D[l ◦(J 1Khγ, −i)]hv, J 2Kγ i l (J 1Khγ, λx.l,vi) 2. Case analysis on reductions of pullback terms. Consider Example 4.9. In Example 3.9, we showed that ∗ Reduction 16.2. 660 (Ω λhx,yi.pow2(mult(g(hx,yi)))) · Ω 1 h1, 3i −→∗ Let γ ∈ JΓK. By IH, and V ≡ λz.P′, we have 528 πi   ′ ′ ′ ∗ ′   J (Ω λz.P )· ω Vπ j K = J(λv .S ) · ω (P [Vπ j /z])K which Note that 660 528 is exactly the Jacobian matrix of means for any 1-form ϕ and v, f : hx,yi 7→ (x + 1)(2x + y2) 2 at h1, 3i.    ′ ′ ϕ (JP Khγ, JVπ j Kγ i) D[cur(JP K)γ ]hv, JVπ j Kγ i  ′ ′ 5 Related Work = ϕ (JP Khγ, JVπ j Kγ i)(JS Khγ,vi).  We discuss recent works on calculi / languages that provide Let l be a linear morphism to R, then λx.l is a 1-form ′ differentiation capabilities. and hence we have l D[cur(JP K)γ ]hv, JVπ j Kγ i = l(JS′Khγ,vi). By the contra-positive of the  5.1 Differential Lambda-Calculus Hahn-Banach Separation Theorem, it implies ′ ′ The standard bearer is none other than differential λ- D[cur(JP K)γ ]hv, JVπ j Kγ i = JS Khγ,vi. calculus [15], which has inspired the design of our language. Note that by (D-eval) in [21], D[ev ◦ hπi , πj i]hv, xi = The implementation induced by differential λ-calculus is πi (v)(πj (x)) + D[πi (x)]hπj (v), πj (x)i. Hence we have a form of symbolic differentiation, which suffers from ex- Ω V J ( λy.yπiyπ j )· ω Kγ pression swell. For this reason, Manzyuk [22] introduced the = λv.JωKγ Jy y Khγ, JVKγ i D[ev ◦ hπ , π i]hv, JVKγ )i πi π j i j perturbative λ-calculus, a λ-calculus with a forward-mode = λv.JωKγ JV V Kγ πi π j   AD operator. Our language is complementary to these cal- v (JV Kγ ) + D[JV Kγ ]hv , JV Kγ i πi π j πi π j π j culi, in that it implements higher-order reverse-mode AD; = λv.JωKγ JV V Kγ πi π j  moreover, it is call-by-value, which is crucial for reverse- v (JV Kγ ) + D[cur(JP′K)γ ]hv , JV Kγ i πi π j π j π j mode AD to avoid expression swell, as illustrated in Exam- = λv.JωKγ JV V Kγ v (JV Kγ ) + JS′Khγ, JV Kγ i πi π j πi π j π j  ple 3.7. = J(λv.v V + S′[V /v])∗ · ω (V V )Kγ πi π j π j πi π j  What is the relationship between our language and differ-  ential λ-calculus? We can give a precise answer via a compo- sitional translation (−) to a differential λ-calculus extended A simple corollary of Theorem 4.6 is that types are invari- t by real numbers, function symbols, pairs and projections, ant under reductions. defined as follows: Corollary 4.7. (Subject Reduction) For any pullback terms P s, t ::= x | λx.s | s T | Ds · t | π (s) | hs, ti | r | f (T ) | Df · t and P′ where P −→ P′. If Γ ⊢ P : σ, then Γ ⊢ P′ : σ. i S,T ::= 0 | s | s + T where r ∈ R, f ∈ F 4.5 Reverse-mode AD The major cases of the definition of (−)t are; Recall performing reverse-mode AD on a real-valued func- ∗ (σ ∗) := σ ⇒ R r1 n n m n t t . tion f : R → R at a point x0 ∈ R computes a row of the . := λv. fi (πi (v)) (J f ·(S)) := Df · St . ∗ t   = Jacobian matrix J(f )(x0), i.e. (J(f )(x0)) (πp ). rn i 1   t Õ   11     Carol Mak and Luke Ong

.S ∗ S = . S . S fact, our work can be viewed as providing a positive an- (λy 1) · 2 t : λv ( 2)t (λy ( 1)t )v Ω P S = S P P swer to the last paragraph of [10, Sec. 7], where the au- (( λy. )· )t : λxv. t (λy. t )x (D(λy. t )· v)x   thors address the relation between their work and differen- = for fi : ri ×−. (The definitions are provided in full in Ap- tial lambda-calculus. They describe a “naïve” approach of ex- pendix D.) pressing reverse-mode AD in differential lambda-calculus in Because differential λ-calculus does not have linear the sense that it suffers from “expression swell”, which our S function type, ( 1)t is no longer in a linear position in approach does not (see Example 3.7). Moreover, Brunel et ( .S )∗ · S λx 1 2 t . Though the translation does not preserve al. use a program transformation to perform reverse-mode linearity, it does preserve reductions and interpretations  AD, whereas we use a first-class differential operator. Brunel (Lemma 5.1). et al. [1] prove correctness for performing reverse-mode AD on real-valued functions (Theorem 5.6, Corollary 5.7 in [1]), Lemma 5.1. Let P be a term. whereas we allow any (higher-order) abstraction to be the ′ ′ 1. If P −→ P , then there exists a reduct s of P t such that argument of the pullback term and proved that the result of ∗ Pt −→ s in LD . the reduction of such a pullback term is exactly the deriva- 2. JPK = JPt K in C. tive of the abstraction (Corollary 4.8). Building on Elliott [16]’s categorical presentation of A corollaryof Lemma5.1 (1) is that our reductionstrategy reverse-mode AD, and Pearlmutter and Siskind [27]’s idea is strongly normalizing. of differentiating higher-order functions, Vytiniotis et al. [31] developed an implementation of a simply-typed differ- Corollary 5.2 (Strong Normalization). Any reduction se- entiable programming language. quence from any term is finite, and ends in a value. However, all these treatments are not purely higher-order, in the sense that their differential operator can only com- 5.2 Differentiable Programming Languages pute the derivative of an “end to end” first-order program Encouraged by calls [14, 19, 24] from the machine learning (which may be constructed using higher-order functions), community, the development of reverse-mode AD program- but not the derivative of a higher-order function. ming language has been an active research problem. Follow- As far as we know, our work gives the first implemen- ing Pearlmutter and Siskind [27], these languages usually tation of reverse-mode AD in a higher-order programming treat reverse-mode AD as a meta-operator on programs. language that directly computes the derivative of higher- order functions using reverse-mode AD (Corollary 4.8 (2)). First-order Elliott [16] gives a categorical presentation of reverse-mode AD. Using a functor over Cartesian categories, 6 Conclusion and Future Directions he presents a neat implementation of reverse-mode AD. As is well-known, conditional does not behave well with After outlining the mathematical foundation of reverse- smoothness [6]; nor does loops and recursion. Abadi and mode AD as the pullback of differential 1-forms (Section Plotkin [2] address this problem via a first-order language 2.2), we presented a simple higher-order programming lan- Ω P S with conditionals, recursively defined functions, and a con- guage with an explicit differential operator, ( (λx. )) · , struct for reverse-mode AD. Using real analysis, they prove (Subsection 3.1) and a call-by-value reduction strategy to di- the coincidence of operational and denotational semantics. vide (A-reductions in Subsection 3.2), conquer (pullback re- To our knowledge, these treatments of reverse-mode AD ductions in Subsection 3.3) and combine (Subsection 3.4) the Ω P S are restricted to first-order functions. term ( (λx. )) · ω , such that its reduction exactly mim- ics reverse-mode AD. Examples are given to illustrate that  Towards higher-order The first work that extends our reduction is faithful to reverse-mode AD. Moreover, we reverse-mode AD to higher orders is by Pearlmutter and show how our reduction can be adapted to a CPS evaluation Siskind [27]; they use a non-compositional program trans- (Subsection 3.5). formation to implement reverse-mode AD. We showed (in Section 4) that any differential λ-category Inspired by Wang et al. [32, 33], Brunel et al. [10] study that satisfies the Hahn-Banach Separation Theorem is a a simply-typed λ-calculus augmented with a notion of lin- sound model of our language (Theorem 4.6) and how our re- ear negation type. Though our dual type may resemble duction precisely captures the notion of reverse-mode AD, their linear negation, they are actually quite different. In in both first-order and higher-order settings (Corollary 4.8). 12 Differential-form Pullback Programming Language and Reverse-mode AD

Future Directions. An interesting direction is to extend [4] F. Bauer. 1974. Computational Graphs and Rounding Error. SIAM J. our language with probability, which can serve as a compiler Numer. Anal. 11, 1 (1974), 87–96. hps://doi.org/10.1137/0711010 intermediate representation for “deep” probabilistic frame- [5] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017. Automatic Differentiation in works such as Edward [29] and Pyro [30]. Inference algo- Machine Learning: a Survey. J. Mach. Learn. Res. 18 (2017), 153:1– rithms that require the computation of gradients, such as 153:43. hp://jmlr.org/papers/v18/17-468.html Hamiltonian Monte Carlo and variational inference, which [6] Thomas Beck and Herbert Fischer. 1994. The if-problem in auto- Edward and Pyro rely on, can be expressed in such a lan- matic differentiation. J. Comput. Appl. Math. 50, 1 (1994), 119 – 131. guage and allows us to prove correctness. hps://doi.org/10.1016/0377-0427(94)90294-1 [7] Michael Betancourt. 2018. A geometric theory of higher-order auto- matic differentiation. arXiv preprint arXiv:1812.11592 (2018). [8] Richard Blute, Thomas Ehrhard, and Christine Tasson. 2010. A References convenient differential category. CoRR abs/1006.3140 (2010). [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy arXiv:1006.3140 hp://arxiv.org/abs/1006.3140 Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof- [9] Richard F Blute, J Robin B Cockett, and Robert AG Seely. 2009. Carte- frey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, sian differential categories. Theory and Applications of Categories 22, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit 23 (2009), 622–672. Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Mar- [10] Alois Brunel, Damiano Mazza, and Michele Pagani. 2019. Back- tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: propagation in the Simply Typed Lambda-calculus with Lin- A System for Large-Scale Machine Learning. In 12th USENIX ear Negation. CoRR abs/1909.13768 (2019). arXiv:1909.13768 Symposium on Operating Systems Design and Implementation, hp://arxiv.org/abs/1909.13768 OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 265–283. [11] Antonio Bucciarelli, Thomas Ehrhard, and Giulio Manzonetto. hps://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi2010. Categorical Models for Simply Typed Resource Cal- [2] Martín Abadi and Gordon D. Plotkin. 2020. A simple differ- culi. Electr. Notes Theor. Comput. Sci. 265 (2010), 213–230. entiable programming language. PACMPL 4 (2020), 38:1–38:28. hps://doi.org/10.1016/j.entcs.2010.08.013 hps://doi.org/10.1145/3371106 [12] Alonzo Church. 1965. The Calculi of Lambda-Conversion. New York : [3] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Anger- Kraus Reprint Corporation. müller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin [13] J. Robin B. Cockett and Geoff S. H. Cruttwell. 2014. Differential Struc- Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Ar- ture, Tangent Structure, and SDG. Applied Categorical Structures 22, naud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher 2 (2014), 331–417. hps://doi.org/10.1007/s10485-013-9312-0 Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier [14] David Dalrymple. 2016. 2016: What do you consider the most Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre Luc interesting recent [scientific] news? What makes it important? Carrier, Kyunghyun Cho, Jan Chorowski, Paul F. Christiano, Tim hps://www.edge.org/response-detail/26794. (2016). Accessed: 2020- Cooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron C. Courville, 01-07. Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Des- [15] Thomas Ehrhard and Laurent Regnier. 2003. The differen- jardins, Sander Dieleman, Laurent Dinh, Melanie Ducoffe, Vincent tial lambda-calculus. Theor. Comput. Sci. 309, 1-3 (2003), 1–41. Dumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan hps://doi.org/10.1016/S0304-3975(03)00392-X Firat, Mathieu Germain, Xavier Glorot, Ian J. Goodfellow, Matthew [16] Conal Elliott. 2018. The simple essence of automatic differentiation. Graham, Çaglar Gülçehre, Philippe Hamel, Iban Harlouchet, Jean- PACMPL 2, ICFP (2018), 70:1–70:29. hps://doi.org/10.1145/3236765 Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean, [17] Alfred Frölicher and Andreas Kriegl. 1988. Linear spaces and differen- Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lam- tiation theory. Chichester : Wiley. blin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrançois, Simon [18] Philipp H. W. Hoffmann. 2016. A Hitchhiker’s Guide to Automatic Lemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, Cory Differentiation. Numerical Algorithms 72, 3 (01 Jul 2016), 775–811. Lorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier hps://doi.org/10.1007/s11075-015-0067-6 Mastropietro, Robert McGibbon, Roland Memisevic, Bart van Mer- [19] Yann LeCun. 2018. Deep Learning est riënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Christo- mort. Vive Differentiable Programming! pher Joseph Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel, hps://www.facebook.com/yann.lecun/posts/10155003011462143. Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth, (2018). Accessed: 2020-01-07. Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, John [20] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. 2015. Auto- Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, grad: Effortless Gradients in Numpy. Presented in AutoML Workshop, Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. Ramana ICML, Cascais, Portugal. Subramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder, [21] Giulio Manzonetto. 2012. What is a categorical model Joseph P. Turian, Sebastian Urban, Pascal Vincent, Francesco Visin, of the differential and the resource λ-calculi? Mathemat- Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Will- ical Structures in Computer Science 22, 3 (2012), 451–520. son, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, and Ying Zhang. hps://doi.org/10.1017/S0960129511000594 2016. Theano: A Python framework for fast computation of mathe- [22] Oleksandr Manzyuk. 2012. A Simply Typed λ-Calculus of Forward matical expressions. CoRR abs/1605.02688 (2016). arXiv:1605.02688 Automatic Differentiation. Electr. Notes Theor. Comput. Sci. 286 (2012), hp://arxiv.org/abs/1605.02688 257–272. hps://doi.org/10.1016/j.entcs.2012.08.017

13 Carol Mak and Luke Ong

[23] Peter W. Michor and Andreas Kriegl. 1997. The convenient setting of global analysis. Providence, R.I. : American Mathematical Society. [24] Christopher Olah. 2015. Neural Networks, Types, and Functional Programming. hp://colah.github.io/posts/2015-09-NN-Types-FP/. (2015). Accessed: 2020-01-07. [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin- tala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. CoRR abs/1912.01703 (2019). arXiv:1912.01703 hp://arxiv.org/abs/1912.01703 [26] Barak A. Pearlmutter. 2019. A Nuts-and-Bolts Differential Geometric Perspective on Automatic Differentiation. Presented in Languages for Inference Workshop, Cascais, Portugal. [27] Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse- mode AD in a functional framework: Lambda the ultimate back- propagator. ACM Trans. Program. Lang. Syst. 30, 2 (2008), 7:1–7:36. hps://doi.org/10.1145/1330017.1330018 [28] Amr Sabry and Matthias Felleisen. 1992. Reasoning About Programs in Continuation-Passing Style. In Proceedings of the Conference on Lisp and Functional Programming, LFP 1992, San Francisco, California, USA, 22-24 June 1992. ACM, 288–298. hps://doi.org/10.1145/141471.141563 [29] Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep Probabilistic Programming. CoRR abs/1701.03757 (2017). arXiv:1701.03757 hp://arxiv.org/abs/1701.03757 [30] Uber. 2017. Pyro (Retrieved) Nov 2018. (2017). hp://pyro.ai/ [31] Dimitrios Vytiniotis, Dan Belov, Richard Wei, Gordon Plotkin, and Martin Abadi. 2019. The Differentiable Curry. Presented in Program Tranformations for Machine Learning Workshop, NeurIPS, Vancou- ver, Canada. [32] Fei Wang, James M. Decker, Xilun Wu, Grégory M. Essertel, and Tiark Rompf. 2018. Backpropagation with Callbacks: Founda- tions for Efficient and Expressive Differentiable Programming. In Advances in Neural Information Processing Systems 31: An- nual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., Samy Ben- gio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 10201–10212. hp://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018 [33] Fei Wang, Daniel Zheng, James M. Decker, Xilun Wu, Grégory M. Es- sertel, and Tiark Rompf. 2019. Demystifying differentiable program- ming: shift/reset the penultimate backpropagator. PACMPL 3, ICFP (2019), 96:1–96:31. hps://doi.org/10.1145/3341700 [34] R. E. Wengert. 1964. A simple automatic derivative eval- uation program. Commun. ACM 7, 8 (1964), 463–464. hps://doi.org/10.1145/355586.364791

14 Differential-form Pullback Programming Language and Reverse-mode AD

Appendix We show how it can be reduced when applied to h1, 3i. (Ω λhx,yi.hhx,yi, hx,yii) · A Examples Ω ( λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · 1 A.1 Simple Example (Ω λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2)i) · 3 ©(Ω λhhx,yi, z , z , z i.pow2(z )) · ω ª   We focus on how to compute the derivative of f : hx,yi 7→ ­ 1 2 3 3 ® ­ , . , , , ∗ ® + + 2 2 ­ (λhv1 v2i hhv1 v2i hv1 v2ii) · ® (x 1)(2x y ) at h1, 3i by different modes of AD. 20.1 (Ω λhhx,yi, z i.hhx,yi, z ,g(z )i) · −−−−→ 1 1 1 « (Ω λhhx,yi, z , z i.hhx,yi, z , z , mult(z ¬)i) · 1 , 1 First f is decomposed into elementary functions as 11 1 2 1 2 2 h 3 3 i  2 Ω , , , , . g ∗ (−) © ( λhhx yi z1 z2 z3i pow2(z3)) · ω     ª R2 R2 R R = + + 2 © ∗ ª −−→ −−→ −−−→ , where g(hx,yi) : hx 1, 2x y i. ­(λhv1,v2i.hhv1,v2i, hv1,v2ii) · ® ­­ ® ∗ ® Then, Figure 4 summarize the iterations of different modes 20.1 (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · −−−−→ «« ¬ ¬ (Ω λhhx,yi, z , z i.hhx,yi, z , z , mult(z )i) · of AD. 14,3 1 2 1 2 2 h 1 , 1 , 2 i © ((Ω λhhx,yi, z1, z2, z3i.pow2(z3)) · ω) 3 3 11 ª ­       ® Now we show how Section 3 tells us how to perform ­  ® ­ (⋆)® reverse-mode AD on f . ∗ (λhv1,v2i.hhv1,v2i, hv1,v2ii) · « ∗ ¬ (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · 20.1 ∗ Term Assuming g, mult, pow2 ∈ F , we can define the fol- −−−−→ (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · 14,3 1 1 2 © (Ω λhhx,yi, z1, z2, z3i.pow2(z3)) · ω h , , , i ª lowing term in the language. ­ 3 3 11 22 ® ­       ® ­  ∗  ® 2∗ (λhv1,v2i.hhv1,v2i, hv1,v2ii) ·  ⊢ (Ω λhx,yi.pow2(mult(g(hx,yi)))) · (Ω 1 ) h1, 3i : R ­ ∗ ® 20.1 (λhhv ,v i,v i.hhv ,v i,v , (Jg · v )h1, 3ii) · « 1 2 3 1 2 3 3 ¬   ∗ −−−−→ , , , . , , , , , ∗ This term is the application of the pullback Ω(f )(λx. 1 ) to 14,3 (λhhv1 v2i v3 v4i hhv1 v2i v3 v4 (Jmult · v4)h2 11ii) · ©(λhhv ,v i,v ,v ,v i.(J pow2 · v )22)∗ ·(ω 484) ª , , ­ 1 2 3 4 5 5 ® the point h1 3i, which is exactly the Jacobian of f ath1 3i. ­ ® Notice­ how this is reminiscent of the forward phase of ® reverse-mode« AD performed on f : hx,yi 7→ (x + 1)(2x + ¬ Administrative Reduction We decompose the term y2) 2 at h1, 3i considered in Figure 4. pow2(mult(g(hx,yi))), via administrative reduction, into a 3 let series of elementary terms. Moreover, we used the reduction f (r) −→ f (r) couples of

let z1 = hx,yi; times in the argument position of an application. This is to = ∗ z2 g(z1); avoid expression swell. Note 1 + 1 is only evaluated once in pow2(mult(g(hx,yi))) −→ L ≡ = A z3 mult(z2); (⋆) even when the result is used in various computations. z4 = pow2(z3) in z4. g This is reminiscent of the decomposition of f into R2 −−→ ∗ (−)2 R2 R R ∗ −−→ −−−→ before performing AD. Combine Replacing ω by Ω 1 ≡ λx. 1 , we have shown so far that     Spli ing the Omega Now via reduction 7 and 8, (Ω λhx,yi.pow2(mult(g(hx,yi)))) · Ω 1 h1, 3i (Ω λhx,yi.L)· ω is reduced to a series of pullback along ele-   mentary terms. is reduced to  ∗ (λhv1,v2i.hhv1,v2i, hv1,v2ii) · let z1 = hx,yi; ∗ = (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · z2 g(z1); ∗ . (Ω λhx,yi. )· ω (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · z = mult(z ); ∗ 3 2 ©(λhhv ,v i,v ,v ,v i.(J pow2 · v )22) ·(ω 484) ª z = pow2(z ) in z ­ 1 2 3 4 5 5 ® 4 3 4 ­ ® (Ω λhx,yi.hhx,yi, hx,yii) · Now­ via reduction 5 and β reduction, we further reduce® it Ω ∗ ( λhhx,yi, z1i.hhx,yi, z1,g(z1)i) · « ¬ −→ Ω to ( λhhx,yi, z1, z2i.hhx,yi, z1, z2, mult(z2)i) · ∗ Ω (λhv1,v2i.hhv1,v2i, hv1,v2ii) · ©( λhhx,yi, z1, z2, z3i.pow2(z3)) · ω ª ∗ ­ ® (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) · ­ ® ∗ ­ ® (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) · ∗ ©(λhhv ,v i,v ,v ,v i.(Jpow2 · v )22) · 1 ª Pullback Reduction« We showed that via A-reductions¬ ­ 1 2 3 4 5 5 ® ­ ® ∗ and Reductions 7 and 8, (Ω λhx,yi.pow2(mult(g(hx,yi))))·ω ­   ® 0 ∗ 0 is reduced to « (λhv1,v2i.hhv1,v2i, hv1,v2ii) · ¬ 0 ∗   Ω −→ (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) ·  0  ( λhx,yi.hhx,yi, hx,yii) · ∗   (λhhv1,v2i,v3,v4i.hhv1,v2i,v3,v4, (Jmult · v4)h2, 11ii) ·  0  (Ω λhhx,yi, z1i.hhx,yi, z1,g(z1)i) ·   © ª 0  (Ω hh , i, , i.hh , i, , , ( )i) · ­ ®44 λ x y z1 z2 x y z1 z2 mult z2   ©(Ω hh , i, , , i. ( )) · ª   λ x y z1 z2 z3 pow2 z3 ω « ¬  ­ ®   ­ ®   ­ ® 15 « ¬ Carol Mak and Luke Ong

+ ∗ + 2 ∗ 2 2 Naïve Forward Mode: 1 0 g / 1 1 2 1 3 1 0 ∗ / 2 11 (−) / 22 hh1, 3i| 0 1 i hh 2 , 11 i| 2 6 i h 22 | [15 12]i h484 | [660 528]i

h i h i 2 Forward Mode: 1 g / 1 ∗ / (−) / hh1, 3i| 0 i hh2, 11i| 2 i h22 | [15]i h484 | [660]i h i h i g ∗ (−)2 Reverse Mode: Forward Phase: h1, 3i / h2, 11i / 22 / 484 2 660 o g 484 o ∗ o (−) Reverse Phase: 528 88 [44] [1] Ω(g) ◦ Ω(∗) ◦ Ω((−)2)h (λxi.[1])(h1, 3 hi) i ∗ = (J(g)(h1, 3i)) Ω(∗) ◦ Ω((−)2) (λx.[1])(h2, 11i) ∗ ∗ = (J(g)(h1, 3i)) (J(∗)(h 2, 11i)) Ω((−)2) (λx.[1])(22) ∗ ∗ ∗ = (J(g)(h1, 3i)) (J(∗)(h 2, 11i)) (J((−)2)(22)) (λx.[1])(484) Pullback: ∗ ∗ = (J(g)(h1, 3i)) (J(∗)(h2, 11i)) [44]  = ∗ 484  (J(g)(h1, 3i)) 88 = 660 h i 528 h i Figure 4. Different modes of automatic differentiation performed on the function f : hx,yi 7→ (x + 1)(2x + y2) 2 at h1, 3i, g ∗ (−)2 after f is decomposed into elementary functions: R2 −−→ R2 −−→ R −−−→ R, where g(hx,yi) := hx + 1, 2x + y2i. 

∗ 0 i.e. l (λxy.x + y) 0 via administrative reduction described in ∗ 0 (λhv1,v2i.hhv1,v2i, hv1,v2ii) · 0 Subsection 3.2. −→ ∗   (λhhv1,v2i,v3i.hhv1,v2i,v3, (Jg · v3)h1, 3ii) ·  0      484 l (λxy.x + y) 0    88  ∗ ′ = ′ ′ = + ′ ∗   −→A (let z1 l in z1)(λxy.let z2 x y in z2) 0   ′ ′   (let z = 0 in z ) ∗ 0   3 3 −→ (λhv1,v2i.hhv1,v2i, hv1,v2ii) · =  660 let z1 l; 528 ∗ = . ′ = + ′ ′ = ′    −→A z2 λxy (let z2 x y in z2); (let z3 0 in z3) ∗   z = z z in z −→ 660    3 1 2 3  528   let z1 = l;   ′ ′   z2 = λxy.(let z = x + y in z ); ∗ 2 2 Notice how this mimics the reverse phase of reverse-mode −→ z3 = z1 z2; A = AD on f : hx,yi 7→ (x + 1)(2x + y2) 2 at h1, 3i considered z4 0; z5 = z3 z4 in z5 in Figure 4. 

A.2 Sum Example Spli ing the Omega After the A-reductions where Consider the function that takes a list of real numbers and l (λxy.x + y) 0 is A-reduced to a let series, we reduce returns the sum of the elements of a list. We show how Sec- (Ω (λl.l (λxy.x + y) 0)) · ω, via Reductions 7 and 8. tion 3 tells us how to perform reverse-mode AD on such a (Ω (λl.l (λxy.x + y) 0)) · ω higher-order function. let z1 = l; ′ ′ z2 = λxy.let z = x + y in z ; ∗ Ω = 2 2 Term Using the standard Church encoding of List, i.e. −→A ( λl. z3 z1 z2; )· ω z4 = 0; List(X )≡(X → D → D)→(D → D) z5 = z3 z4 in z5 Ω [x1, x2,..., xn]≡ λfd.f xn ... (f x2 (f x1 d)) ( λl.hl,li) · (Ω λhl, z1i.hl, z1, λxy.Li) · ∗ Ω for some dummy type D, sum : List(R) → R can be −→ ( λhl, z1, z2i.hl, z1, z2, z1 z2i)· Ω ©( λhl, z1, z2, z3i.hl, z1, z2, z3, 0i)· ª expressed in our language described in Section 3 to be ­ Ω ® ­( λhl, z1, z2, z3, z4i.z3 z4)· ω ® λl.l (λxy.x +y) 0. Hence the derivative of sum at a list [7, −1] ­ ® ­ ® can be expressed as « ¬ {ω : Ω(List(R))} ⊢ (Ω (sum)) · ω [7, −1] : R∗. Pullback Reduction First, Figure 5 shows that (Ω [7, −1]) · ω′ (λxy.L) is reduced to  Administrative Reduction We first decompose the body (λv.v−1(+(h7,di)) + (J+(h−1, −i) ·(v7d)) (+h7,di))∗ · ω′ A of the sum : List(R) → R term, considered in Example 3.2,  16 Differential-form Pullback Programming Language and Reverse-mode AD

(Ω [7, −1]) · ω′ (λxy.L) Ω ′ L ≡ ( λfd.f −1 (f 7d)) · ω (λxy. ) = let z1 f  z2 = −1 z3 = z1 z2 z = f ∗ 4 ′ −→ (Ω λfd. z5 = 7 )· ω (λxy.L) A = z6 z4 z5 ! z7 = d z8 = z6 z7 z9 = z3 z8 in z9 (Ω λf .hf , f i)· (Ω λhf , z1i.hf , z1, −1i)· (Ω λhf , z1, z2i.hf , z1, z2, z1 z2i)· (Ω λhf , z , z , z i.hf , z , z , z , f i)· ∗ © 1 2 3 1 2 3 ª −→ ­(Ω λhf , z1, z2, z3, z4i.hf , z1, z2, z3, z4, 4i)· ® (λxy.L) ­ Ω ® ­( λhf , z1, z2, z3, z4, z5i.hf , z1, z2, z3, z4, z5, z4 z5i) · ® ­(Ω h , , , , , , i.h , , , , , , , i)· ® ­ λ f z1 z2 z3 z4 z5 z6 f z1 z2 z3 z4 z5 z6 d ® ­(Ω λhf , z1, z2, z3, z4, z5, z6, z7i.hf , z1, z2, z3, z4, z5, z6, z7, z6 z7i)· ® ­ Ω ′® ­( λhf , z1, z2, z3, z4, z5, z6, z7, z8i.hf , z1, z2, z3, z4, z5, z6, z7, z8, z3 z8i)· ω ® ­ ® ­(λv.hv,vi)∗ · ® ∗ (λhv,v1i.hv,v1, 0i) · « ∗ ¬ (λhv,v1,v2i.hv,v1,v2,v1−1 + λy.(J+ · hv2, 0i) h−1,yii) · ∗ (λhv,v1,v2,v3i.hv,v1,v2,v3,vi) · ∗ © ∗ ª ­(λhv,v1,v2,v3,v4i.hv,v1,v2,v3,v4, 0i) · ® −→ ­ ∗ ® ­(λhv,v1,v2,v3,v4,v5i.hv,v1,v2,v3,v4,v5,v47 + λy.(J+ · hv5, 0i) h7,yii) · ® ­ ∗ ® ­(λhv,v1,v2,v3,v4,v5,v6i.hv,v1,v2,v3,v4,v5,v6, 0i) · ® ­ + + ∗ ® ­(λhv,v1,v2,v3,v4,v5,v6,v7i.hv,v1,v2,v3,v4,v5,v6,v7,v6 d (J (h7, −i) · v7)di) · ® ­ + + + + ∗ ′ ® ­(λhv,v1,v2,v3,v4,v5,v6,v7,v8i.hv,v1,v2,v3,v4,v5,v6,v7,v8,v3 ( (h7,di)) (J (h−1, −i) · v8)( (h7,di))i) · ω A® ­ ® 6 ∗ ­ ® −−→ (λv.v−1(+(h7,di)) + (J+(h−1, −i) ·(v7d)) (+h7,di))∗ · ω′ A « ¬ where A ≡ hλxy.L, λxy.L, −1, λy.+(h−1,yi), λxy.L, 7, λy.+(h7,yi),d, +(h7,di)i

Figure 5. Reduction of (Ω [7, −1]) · ω′ (λxy.L)  Then, we reduce (Ω (sum)) · ω [7, −1] as follows. B Administrative Reduction Ω E L ( λl.hl,li) ·  Elementary terms , let series , A-contexts CA and A- (Ω λhl, z1i.hl, z1, λxy.Li) · redexes rA are defined as follows. (Ω λhl, z1, z2i.hl, z1, z2, z1 z2i) · [7, −1] Ω ©( λhl, z1, z2, z3i.hl, z1, z2, z3, 0i)· ª E ::= 0 | z + z | z | λx.L | z z | z | hz , z i | r ­ Ω , , , , . ® 1 2 1 2 i 1 2 ­( λhl z1 z2 z3 z4i z3 z4)· ω ® L ∗ Ω L r∗ ­ (λv.hv,vi)∗ · ® | f (z) | J f · z | (λx. ) · z | ( λx. )· z | ­ ∗ ® (λhv,v1i.hv,v1, 0i) · L ::= let z = E in L | let z = E in z. « ¬ L + (λhv,v1,v2i.hv,v1,v2,v1(λxy. ) C ::= [] | C + P | L + C | λz.C | C P | LC | π (C ) ∗ + + + + ∗ A A A A A A i A −→ © v2−1( (h7,di)) (J (h−1, −i) ·(v27d)) ( h7,di)i) ·ª ∗ ­ ∗ ® | hCA, Si | hL,CAi | f (CA) | J f · CA | (λx.CA) · S ­(λhv,v1,v2,v3i.hv,v1,v2,v3, 0i) · ® ­ ® |(λx.L)∗ · C | (Ω (λx.C )) · S | (Ω (λx.L)) · C ­(λhv,v1,v2,v3,v4i.hv,v1,v2,v3,v4, ® A A A ­ + (J+(h− , +(h , −i)i) · ) i)∗ · ® = L + L L L L L L L ­ v30 1 7 v4 0 ωB ® rA :: 0 | 1 2 | x | λz. | 1 2 | πi ( ) | h 1, 2i | r ∗ ­ L ∗ ® ∗ ∗ −→ (­λv.v(λxy. )0) · ωB ® | f (L) | J f · L | (λx.L1) · L2 | (Ω λx.L1)· L2 | r « , , , , .L, .+ , + , , , ¬ where B ≡ h[7 −1] [7 −1] λxy λd (h−1 (h7 di)i) 0 6i. Lemma B.1. Every pullback term P can be expressed as ei- Hence, λv.v(λxy.L)0 is the derivative of sum ≡ λl.l(λxy.L)0 ther CA[rA] for some unique A-context CA and A-redex rA or , at [7 −1]. a let series of elementary terms L. This sequence of reduction tells us how the derivative of sum at [7, −1] can be computed using reverse-mode AD. An A-redex rA is reduced to a let series L as follows.

0 −→A let x1 = 0 in x1 L1 + L2 −→A let x1 = L1; x2 = L2; x3 = x1 + x2 in x3 x −→A let x1 = x in x1 λz.L −→A let x1 = λz.L in x1 17 Carol Mak and Luke Ong

L1 L2 −→A let x1 = L1; x2 = L2; x3 = x1 x2 in x3 JrK = λγ .r πi (L) −→A let x1 = L; x2 = πi (x1) in x2 Jf (s)K = f ◦ JsK hL1, L2i −→A let x1 = L1; x2 = L2; x3 = hx1, x2i in x3 JDf · sK = λγx.D[f ]hJsKγ, xi r −→A let x1 = r in x1 f (L) −→A let x1 = L; x2 = f (x1) in x2 Translation to Differential Lambda Calculus J f · L −→A let x1 = L; x2 = J f · x1 in x2 0 := 0 π (S) := π (S ) (λx.L )∗ · L −→ let x = L ; x = (λx.L )∗ · x in x t i t i t 1 2 A 1 2 2 1 1 2 (S + P) = S + P (hS , S i) = h(S ) , (S ) i Ω .L L = L = Ω .L t : t t 1 2 t : 1 t 2 t ( λx 1)· 2 −→A let x1 2; x2 ( (λx 1)) · x1 in x2 = = ∗ ∗ yt : y rt : r r −→A let x1 = r in x1 (λy.S) := λy.St (f (P)) := f (Pt ) t t Any pullback term P which can be expressed as C [r ] S P = S P S = S A A ( )t : t t (J f · ) : Df ·( )t L L ∗ t can be A-reduced to CA[ ] where rA −→A . r1 n . . := λv. fi (πi (v))   = C Interpretation rn i 1   t Õ   .S ∗ S = . S . S (λy 1) · 2 t : λv ( 2)t (λy ( 1)t )v JΓ ⊢ 0: σKγ = 0   ((Ω λy.P)· S) := λxv.St (λy.Pt )x (D(λy.Pt )· v)x JΓ ⊢ S + P : σKγ = JSKγ + JPKγ t  = JΓ ∪ {x : σ }⊢ x : σKhγ, zi = z where f : ri ×−.   JΓ ⊢ λx.S : σ1 ⇒ σ2Kγ = cur(JSK)γ JΓ ⊢ S P : σ2Kγ = (JSKγ )(JPKγ ) E Proofs Γ S = S J ⊢ πi ( ) : σi Kγ πi (J Kγ ) Proposition E.1. The derivative of any constant morphism Γ S S = S S J ⊢ h 1, 2i : σ1 × σ2Kγ hJ 1Kγ, J 2Kγ i f in a differential λ-category is 0, i.e. D[f ] = 0. JΓ ⊢ r : RKγ = r JΓ ⊢ f (P) : RmKγ = f (JPKγ ) Proof. A constant morphism f : A → B that maps all of A to = JΓ ⊢ J f · S : Rm ⇒ RmKγ = cur(D[f ])(JSKγ ) b ∈ B can be written as f (λz.b) ◦ 0 where 0 : A → B and ∗ ∗ λz.b : B → B. So by [CD1,2,5] we have D[f ] = D[(λz.b) ◦ JΓ ⊢(λx.S1) · S2 : σ1 Kγ = λv.JS2Kγ JS1Khγ,vi = . , = . , =  JΓ ⊢(Ω λx.P)· S : Ωσ1Kγ = λxv.JSKγ (JPKhγ, xi) 0] D[λz b] ◦ hD[0] 0 ◦ π2i D[λz b] ◦ h0 0 ◦ π2i 0. P  (D[cur(J K)γ ]hv, xi) Con∞ ∗ Lemma E.2. is a differential λ-category with the dif- r1 v1 n . n∗ . ferential operator JΓ ⊢ . : R Kγ = . 7→ ri vi     = D[f ]hv, xi := J(f )(x)(v) = limt→0(f (x + tv)− f (x))/t. rn vn i 1     Õ     ∞     Proof. [17, 23] have shown that Con is Cartesian closed,     D Extended Differential Lambda-Calculus and [8] have shown that Con∞ is a Cartesian differential Differential substitution of the extended differential λ-terms category. What is left to show is that λ(−) preserves the are defined as follows. additive structure and D[−] satisfies the (D-curry) rule, i.e. ∂ ∂s D[λ(f )] = λ D[f ] ◦ hπ × 0, π × Idi . πi (s)· T ≡ πi · T 1 2 ∂x ∂x + = + ∂ , ∂s1 , ∂s2 We first show that λ(−) is additive, i.e. λ(f g) λ(f ) ∂x hs1 s2i· T ≡ h ∂x · T ∂x · T i  ∂  λ(g) and λ(0) = 0. Note that for f ,g, 0 : A × B → C and r · T ≡ 0 ∂x a ∈ A, b ∈ B, λ(f + g)(a)(b) = (f + g)ha,bi = f ha,bi + ∂ ∂s ∂x (f (s)) · T ≡ Df · ∂x · T s gha,bi = λ(f )(a)(b)+λ(g)(a)(b)and λ(0)(a)(b) = 0ha,bi = ∂ ( · )· ≡ · ∂s · = ∂x Df s T D f ∂ x T  0 0(a)(b). Now we show that D[−] satisfies the (D-curry) rule. Let Consider the term f (s). There are no linear occurrences , of x in f . Hence, we ignore f and perform differential sub- f : A × B → C, v x ∈ A and b ∈ B. + ∂s = λ(f )(x vt)− λ(f )(x) stitution to s directly and obtain Df · ∂ · T s. D[λ(f )] hv, xi b lim b x t→0 t We can interpret the extended differential λ-calculus with  f hx + vt,bi− f hx,bi   = lim a differential λ-category, which is the categorical semantics t→0 t f (hx,bi + t hv, 0i)− f hx,bi of differential λ-calculus. Hence, what is left to show is the = lim interpretations of the extended terms. t→0 t = D[f ]hhv, 0i, hx,bii Jπi (s)K = πi ◦ JsK = D[f ] ◦ hπ1 × 0, π2 × Idi hhv, xi,bi Jhs , s iK = hJs K, Js Ki = λ D[f ] ◦ hπ × 0, π × Idi hv, xi b 1 2 1 2 1 2  18  Differential-form Pullback Programming Language and Reverse-mode AD

 + JS2Khγ, xi(D[JS1Khγ1, −, zi]hv, xi) = λz.JS Khγ, xi(D[JS Khγ , −, zi]hv, xi) Proposition E.3. Let E be a convenient vector space and 2 1 1 = λz.JS2Khγ,vi(JS1Khγ1,v, zi) x,y ∈ E be distinct elements in E. Then, there exists a ∗ = J(λx.S1) · S2Khγ1,vi bornological linear map l : E → R that separates x and y, ∗ = cur(J(λx.S1) · S2K)γ1) ◦ π1 hv, xi i.e. l(x) , l(y). ∗ ∗ (2) Let Γ2 ⊢ (λx.S1) · S2 : σ and γ2 ∈ JΓ2K. Then, by Proof. This follows from the fact that convenient vector IH.1 and IH.2, space is separated. S ∗ S x , y implies that x − y , 0. Hence by separation, there D[J(λx. 1) · 2Kγ2] = S S is a bornological linear map l : E → R such that l(x −y) , 0. D[(J 2Kγ2) ◦ cur(J 1K)γ2 ] = S S , S Notice that l is linear, so we have l(x)− l(y) , 0 which D[J 2Kγ2] ◦ hD[cur(J 1K)γ2] (cur(J 1K)γ2) ◦ π2i = S S  ,  (J 2Kγ2)◦(cur(J 1K)γ2) ◦ π1 implies l(x) l(y). ∗ = (J(λx.S1) · S2Kγ2) ◦ π1 Lemma 4.4 (Linearity). Let Γ1 ∪ {x : σ1} ⊢ P1 : τ and ∗ All other cases are straight forward inductive proofs. Γ2 ⊢ P2 : σ . Let γ1 ∈ JΓ1K and γ2 ∈ JΓ2K. Then,  1. if x ∈ lin(P1), then cur(JP1K)γ1 is linear, i.e. D[cur(JP K)γ ] = (cur(JP K)γ ) ◦ π , 1 1 1 1 1 Lemma 4.5 (Substitution). JΓ ⊢ S[P/x] : τ K = JΓ ∪ {x : 2. JP2Kγ is linear, i.e. D[JP2Kγ ] = (JP2Kγ ) ◦ π1. σ }⊢ S : τ K ◦ hIdJΓK, JΓ ⊢ P : σKi Proof. Induction on the structure of P on the following two Proof. The only interesting cases are dual and pullback statements. maps. IH.1 If Γ1 ∪ {x : σ1} ⊢ P : τ and x ∈ lin(P), then for (dual) (λx.S )∗ · S [P′/y]≡(λx.S [P′/y])∗ · S [P′/y] any γ1 ∈ JΓ1K, cur(JPK)γ1 is linear, i.e. D[cur(JPK)γ1] = 1 2 1 2 P ∗ ′ (cur(J K)γ1) ◦ π1. J (λx.S1) · S2 [P /y]Kγ Γ P ∗ Γ P ′ ∗ ′ IH.2 If 2 ⊢ : σ , then for any γ2 ∈ J 2K, J Kγ is linear, i.e. = J(λx.S1[P /y]) · S2[P /y]Kγ ′  ′ D[JPKγ ] = (JP2Kγ ) ◦ π1. = JS2[P /y]Kγ ◦ cur(JS1[P /y]K)γ = S P′ S P′ (var) Say P ≡ x. λx.J 2Khγ, J Kγ i(J 1Khγ, J Kγ, xi) (IH) = J(λx.S )∗ · S Khγ, JP′Kγ i (1) If Γ1 ∪ {x : σ1} ⊢ x : σ1 and x ∈ lin(x), then 1 2 = = = = D[cur(JxK)γ1] D[Id] π1 Id ◦ π1 (cur(JxK)γ1) ◦ (pb) (Ω λx.P)· S [P′/y]≡(Ω λx.P[P′/y]) · S[P′/y] π1. J (Ω λx.P)· S [P′/y]Kγ (2) If Γ ⊢ x : σ ∗, then Γ = Γ ∪ {x : σ ∗} so for any  2 2 3 = J(Ω λx.P[P′/y]) · S[P′/y]Kγ hγ , zi ∈ JΓ K, z is linear and D[JxKhγ , zi] = D[z] =  3 2 3 = λxv. JSKhγ, JP′Kγ i JPKhγ, x, JP′Kγ i z ◦ π = (JP Khγ , zi) ◦ π . 1 2 3 1 [ P h , −, P′ i]h , i P S ∗ S D J K γ J Kγ v x (IH) (dual) Say ≡(λx. 1) · 2. = Ω P S P′  ∗ J( λx. )· Khγ, J Kγ i (1) Let Γ1 ∪ {x : σ1} ⊢(λx.S1) · S2 : τ and x ∈  ∗  lin((λx.S1) · S2) := lin(S1) \ FV(S2) ∪ lin(S2) \ FV(S1) , then for any γ1 ∈ JΓ1K and since JS2Khγ1, xi  Theorem 4.6 (Correctness of Reductions). Let Γ ⊢ P : σ. is of a dual type, by IH.2,  1. P −→ P′ implies JPK = JP′K. D[cur(J(λx.S )∗ · S K)γ ]hv, xi A 1 2 1 2. P −→ P′ implies JPK = JP′K. = λz. D[JS2Khγ, −i]hv, xi g(hx, zi) + S , , , , , D[J 2Khγ xi]hD[g(h− zi)]hv xi g(hx zi)i Proof. 1. Easy induction on −→A. = S  λz. D[J 2Khγ, −i]hv, xi g(hx, zi) 2. Case analysis on reductions of pullback terms. Let γ ∈ + JS Khγ, xi(D[g(h−, zi)]hv, xi) Γ 2  J K. .S V = S V V , V = V = where g : hx, zi 7→ JS1Khγ1, x, zi. Note that x can only (1-4) J(λx ) K J [ /x]K, πi (h 1 2i) J i K, f (r) r r′ = r′ r be in either lin(S1) \ FV(S2) or lin(S2) \ FV(S1) but not Jf (r)K and JJ(f )( )( )K D[f ]h , i are easily both. Say x ∈ lin(S1)\FV(S2), then by Proposition E.1 verified using the Substitution Lemma 4.5. ′ and IH.1, (5) Let J(f )(r) = [aij ]i=1,...,m, j=1,...,n and r = ′ ∗ [ri ]i=1,...,m. D[cur(J(λx.S1) · S2K)γ1]hv, xi r ∗ r′∗ = λz. D[JS2Khγ, −i]hv, xi (JS1Khγ1, x, zi) J(λv.J(f )( )(v)) · Kγ 19  Carol Mak and Luke Ong

m m n since cur(JEK)γ is a constant function and the deriv- = (J(f )(r))∗(λv. r ′v ) = λv. r ′ a v i i i ij j ative of any constant function is 0 by Proposition i=1 i=1 j=1 n m Õ n Õ Õ E.1. = ′ = r ⊤ λv. (ri · aij )vj λv. ((J(f )( )) × r)jvj (10) We present the proof for (10b) j=1 i=1 j=1 ∗ Õ Õ ∗ Õ (Ω λy.y + y )· ω V −→ (λv.v + v ) · = J((J(f )(r))⊤ × r) Kγ πi π j πi π j ω V + V which leads to (10.1). πi π j  (6) Say Γ ∪ {v : σ }⊢ V : τ . Let Γ ∪ {v : σ ,v : σ } ⊢ 2 2 2 1 1 2 2 ((Ω . + )· ) V V′ V J λy yπi yπ j ω Kγ 2 : τ where v1 is not a free variable in 2. = λxv.JωKγ (Jyπi + yπ j Khγ, xi)(D[πi + πj ]hv, xi) (JVKγ ) ∗ ∗ ( .V ) · ( .V ) · V = λxv.JωKγ (xπi + xπ j )(vπi + vπ j ) (JVKγ ) J λv1 1 λv2 2 3 Kγ  = λv.JωKγ (((JVKγ ) + ((JVKγ ) )(vπi + vπ j ) = cur( V )γ ∗ cur( V )γ ∗( V γ ) πi π j  J 1K J 2K J 3K = λv.Jω (V + V )Kγ (v + v ) ∗ πi π j πi π j   = + ∗ V + V = cur(JV2K)γ ◦ cur(JV1K)γ (JV3Kγ ) J(λv.vπi vπ j ) · ω ( πi π j )Kγ ∗   Ω V ∗ V = v 1 7→ JV2Khγ , JV 1Khγ,v1ii  (JV3Kγ ) (11) ( λy.y)· ω −→ (λv.v) · ω ∗ =  V′ V  V J((Ω λy.y)· ω) VKγ v1 7→ J 2Khhγ,v1i, J 1Khγ,v1ii (J 3Kγ )  ∗ = λxv.JωKγ (JyKhγ, xi)(D[Id]hv, xi) (JVKγ ) =  7→ V′ ◦ h , V i h , i  ( V ) v1 J 2K Id J 1K γ v1 J 3Kγ = λxv.JωKγxv (JVKγ ) ∗  ∗ = v 7→ JV′ [V /v ]K hγ,v i (JV Kγ ) = λv.JωKγ (JVKγ )v = λv.Jω VKγv = J(λv.v) · ω VKγ 1 2 1 2  1 3  ∗ =  V′ V V  (12) ((Ω λy.y )· ω) V −→ (λv.v )∗ · ω V cur(J 2[ 1/v2]K)γ  (J 3Kγ ) πi πi πi = V′ V ∗ V J(λv1. 2[ 1/v2]) ·  3K J((Ω λy.yπi )· ω) VKγ = . , , V (7) Using the Substitution Lemma 4.5, λxv JωKγ (Jyπi Khγ xi)(D[πi ]hv xi) (J Kγ ) = λxv.JωKγx v (JVKγ ) J(Ω (λy.let x = E in x)) · ωK = J(Ω (λy.E)) · ωK πi πi  = λv.JωKγ (JV Kγ )v = J(λv.v )∗ · ω V Kγ follows immediately from Jλy.let x = E in xK = πi  πi πi πi cur(Jlet x = E in xK) = cur(JEK) = Jλy.EK. (13) We prove for (13c), ((Ω λy.hyπi ,yπ j i)· ω) V −→ ∗ (8) Consider (Ω (λy.let x = E in L)) · ω −→ (λv.hvπi,vπ j i) · ω hVπi , Vπ j i which leads to (13a) (Ω (λy.hy, Ei)) · (Ω (λz.L)) · ω where and (13b). Γ ∪ {z : σ1 × σ2}⊢ L ≡ L[π1(z)/y][π2(z)/x] : τ .  J((Ω λy.hyπi ,yπ j i)· ω) VKγ b J(Ω (λy.let x = E in L)) · ωKγ = λxv.JωKγ (Jhyπi ,yπ j iKhγ, xi)(D[hπi , πj i]hv, xi) (JVKγ ) b = . , , V = Ω cur(Jlet x = E in LK)γ (JωKγ ) λxv JωKγ hxπi xπ j ihvπi vπ j i (J Kγ ) = V V  λv.JωKγ h(J Kγ )πi , (J Kγ )π j ihvπi ,vπ j i = Ω L , E   cur(J K ◦ hId J Ki)γ (JωKγ ) = λv.Jω hVπi , Vπ j iKγ hvπi ,vπ j i = J(λv.hv ,v i)∗ · ω hV , V iKγ = Ωs1 7→ JLKhhγ, s1i, JEKhγ, s1ii (JωKγ ) πi π j πi π j (14) (Ω λy.J f · y )· ω V −→ (λv.J f · v )∗ · = Ωs1 7→ JLKhγ, hs1, JEKhγ, s1iii (JωKγ ) πi πi ω (J f · V ) By [CD3,4,5,6], = Ω L E πi  cur(J K)γ ◦ hIdJσ1K, cur(J K)γ i (JωKγ ) b . , = Ω E Ω L  D[λyz D[f ]hyπi zi] hIdJσ1K, cur(J K)γ i cur(J K)γ (JωKγ ) b = D[cur(D[f ]) ◦ πi ] = ΩhIdJσ K, cur(JEK)γ i J(Ω λz.L)· ωKγ  = D[cur(D[f ])]◦(πi × πi ) 1 b = cur(D[D[f ]]◦hπ × 0, π × Idi)◦(π × π ) = Ωcur(Jhy, EiK)γ J(Ω λz.L)· ωKγ  1 2 i i b = cur(D[f ]◦(π1 × Id))◦(πi × πi ) = Ω E Ω L J( λy.hy, i)· (  λz. )· ω K  b Hence (9) Say y is not free in E and (Ω (λy.E)) · ω V −→ 0. b (Ω λy.J f · y )· ω V γ Then, J πi K  = λv.JωKγ λx.D[f ]hJVπ j Kγ, xi J((Ω λy.E)· ω) VKγ  D[λyz.D[f ]hyπi , zi]hv, JVKγ i = λxv.JωKγ (JEKhγ, xi)(D[cur(JEK)γ ]hv, xi) (JVKγ )  = λv.JωKγ λx.D[f ]hJVπ j Kγ, xi = λxv.JωKγ (JEKhγ, xi)0 (JVKγ )   cur(D[f ]◦(π × Id))◦(π × π ) hv, JVKγ i = (λxv.0)(JVKγ ) = λv.0 = J0Kγ 1 i i  = . V . , λvJωKγ J(J f · πi )Kγ (λx D[f ]hvπi xi)  20  Differential-form Pullback Programming Language and Reverse-mode AD

∗ ∗ = J(λv.J f · vπi ) · ω (J f · Vπi ) Kγ ω ((λz.L[V/y]) · Vπi )Kγ ∗ (15) ((Ω λy.f (yπi )) · ω) V −→ (λv .(J f · vπi )Vπi ) · (17) (Ω λy.(Ω λx.L)· yπi )· ω V −→ ( (V )) (Ω λy.λa.(λv.S)∗ · z L[a/x]) · ω V if ω f πi  (Ω λx.L)· z) a −→∗ (λv.S)∗ · z L[a/x] for J((Ω λy.f (y )) · ω) VKγ  πi fresh variable a. = λxv.JωKγ f x D[f ]hv , x i (JVKγ ) πi πi πi By IH, J (Ω λx.L)· z) aK = J(λv.S)∗ · z L[a/x]K im- = λv.JωKγ f (JV Kγ ) D[f ]hv , JV Kγ i πi  πi πi plies for any ϕ, γ y, a,v, = λv.JωKγ f (JVπi Kγ ) J(J f · vπi )Vπi Khγ,vi ∗  L , , L , , , = J(λv.(J f · vπi )Vπi ) · ω (f (Vπi )) Kγ ϕ (J Khγ a yi)(D[J Khγ − yi]hv ai)   = ϕ (JLKhγ, a,yi)(JSKhγ,y, a,vi). (16) We prove for the most complicated case (16c) which L = leads to (16a) and (16b). By Hahn-Banach Theorem, D[J Khγ, −,yi]hv, ai S By IH, J (Ω λy.L)· ω VK = J(λv.S)∗ · ω V′K implies J Khγ,y, a,vi. for any 1-form ϕ, γ and x,v, Ω .L ,  J( λx )· zKhγ yi = . , L , , L , , , ϕ (JLKhγ, JVKγ, xi)(D[JLKhγ, −, xi]hv, JVKγ i) λav JzKhγ yi(J Khγ v ai)(D[J Khγ − yi]hv ai) = . , L , S , , , = ϕ (JVKγ )(JSKhγ, x,vi). λav JzKhγ yi(J [a/x]Khγ vi)(J Khγ y a vi) = λa.J(λv.S)∗ · z L[a/x]Khγ,y, ai By Hahn-Banach Theorem, we have = Jλa.(λv.S)∗ · z L[a/x]Khγ,yi D[JLKhγ, −, xi]hv, JVKγ i = JSKhγ, x,vi. Hence we have First, note that since Vπi is of the dual type, hence by Lemma 4.4 (2), D[JVπi Kγ ] = (JVπi Kγ ) ◦ π1. J (Ω λy.(Ω λx.L)· z)· ω VKγ ∗ = λv.JωKγ J(Ω λx.L)· zKhγ, JVKγ i D[cur(J(λz.L) · yπi K)γ ]hv, JVKγ i  D[cur(J(Ω λx.L)· zK)γ ]hv, JVKγ i = D[λy.λz.yπi (JLKhγ,y, zi)]hv, JVKγ i  = . .( .S)∗ · L[ / ] h , V i = D[cur(ev ◦ hπ ◦ π ,gi)]hv, JVKγ i λv JωKγ Jλa λv z a x K γ J Kγ i 1 Ω L V  = . , , , V , D[cur(J( λx. )· zK)γ ]hv, J Kγ i λz D[ev ◦ hπi ◦ π1 gi]hhv 0i hJ Kγ zii ∗  = + = J (Ω λy.λa.(λv.S) · z L[a/x]) · ω VKγ λz. ev ◦ hD[πi ◦ π1],g ◦ π2i D[uncur(πi ◦ π1)]  V ∗ ∗ ◦ hh0, D[g]i, hπ2,g ◦ π2ii hhv, 0i, hJ Kγ, zii (18) If (Ω λy.L)· ω V −→ (λv .S) · ω V and x < = V + ∗ λz.vπi (ghJ Kγ, zi) D[uncur(πi ◦ π1)] FV(V), then (Ω λy.λx.L)· V V −→ (λv.λx.S) ·   hh0, D[g]hhv, 0i, hJVKγ, ziii, hhJVKγ, zi,ghJVKγ, ziii V λx.L[V/y]. = V +  λz.vπi (ghJ Kγ, zi) D[uncur(πi )] Recall the (D-curry) rule, D[cur(f )] = hh0, D[g]hhv, 0i, hJVKγ, ziii, hJVKγ,ghJVKγ, ziii cur(D[f ] ◦ hπ1 × 0, π2 × Idi). By IH, we have = λz.v (gh V γ, zi) + D[uncur(π )h V γ, −i] ∗ πi J K i J K J (Ω λy.L)· ω VK = J(λv.S) · ω (L[V/y])K, which hD[g]hhv, 0i, hJVKγ, zii,ghJVKγ, zii means for any 1-form ϕ, γ and x,v, = λz.vπi (ghJVKγ, zi)  L V L V + D[JVπi Kγ ] ◦ hD[g],g ◦ π2i hhv, 0i, hJVKγ, zii ϕ (J Khγ, J Kγ, xi)(D[J Khγ, −, xi]hv, J Kγ i) = L V S = λz.vπi (ghJVKγ, zi) ϕ (J Khγ, J Kγ, xi)(J Khγ, x,vi). +  V vπi ◦ π1 ◦ hD[g],g ◦ π2i hhv, 0i, hJ Kγ, zii By Hahn-Banach Theorem, = λz.v (ghJVKγ, zi) + JV Kγ ) D[gh−, zi]hv, JVKγ i πi πi D[JLKhγ, −, xi]hv, JVKγ i = JSKhγ, x,vi. Now = λz.vπi (JLKhγ, JVKγ, zi) + JVπi Kγ D[JLKhγ, −, zi]hv, JVKγ i  D[cur(Jλx.LK)γ ]hv, JVKγ i = λz.vπi (JLKhγ, JVKγ, zi) + JVπi Kγ JSKhγ, x,vi ∗ ∗  = D[cur(JLK)hγ, −i]hv, JVKγ i = J(λz.L[V/y]) · vπi + (λz.S) · Vπi Khγ,vi.  = D[cur(f )]hv, JVKγ i where g : hy, zi 7→ JLKhγ,y, zi. = cur(D[f ] ◦ hπ1 × 0, π2 × Idi)hv, JVKγ i Now we have = λx.(D[f ] ◦ hπ1 × 0, π2 × Idi)hhv, JVKγ i, xi ∗ = λx.D[f h−, xi]hv, JVKγ i J (Ω λy.(λz.L) · yπi )· ω VKγ ∗ = λx.D[JLKhγ, −, xi]hv, JVKγ i = λv.JωKγ (J(λz.L) · yπi Khγ, JVKγ i)  ∗ = λx.JSKhγ, x,vi D[cur(Jλz.L · yπi K)γ ]hv, JVKγ i = ∗ λv.JωKγ (J(λz.L[V/y]) · Vπi Kγ ) where f := uncur(cur(JLK)hγ, −i). Hence, we have ∗ ∗  J(λz.L[V/y]) · vπi + (λz.S) · Vπi Khγ,vi ∗ ∗ ∗ J((Ω λy.λx.L)· V) VKγ = J(λv.(λz.S) · Vπi + (λz.L[V/y]) · vπi ) ·  = λxv.JVKγ (Jλx.LKhx,γ i)(D[cur(Jλx.LK)γ ]hv, xi) (JVKγ ) 21  Carol Mak and Luke Ong

= λv.JVKγ (Jλx.LKhJVKγ,γ i)(D[cur(Jλx.LK)γ ]hv, JVKγ i) = J(λv.hv, 0i)∗ · ω hV, EiKγ = V L V S λv.J Kγ (Jλx. [ /y]Kγ )(λx.J Khγ, x,vi)  = λv.JV (λx.L[V/y])Kγ (Jλx.SKhγ,vi) = J(λv.λx.S)∗ · V (λx.L[V/y])Kγ Lemma 5.1. Let P be a term. P P′ P′ (19) We prove it for the complicated case (19c) and (19a) 1. If −→ , then there exists a reduct s of t such that P ∗ and (19b) follows. t −→ s in LD . P = P C First note that by (D-eval) in [21], we have 2. J K J t K in . = + D[ev ◦ hπi , πj i]hv, xi πi (v)(πj (x)) Proof. 1. Easy induction on −→. V P′ D[πi (x)]hπj (v), πj (x)i. By IH, and πi ≡ λz. , 2. We prove by induction on P. Most cases are trivial. Let Ω P′ V = ′ S′ ∗ we have J ( λz. )· ω π j K J(λv . ) · γ ∈ JΓK. ω (P′[V /z])K which means for any 1-form ϕ, γ π j  (dual) and v, ∗ J(λx.S1) · S2Kγ = λv.JS2Kγ (cur(JS1K)γv) ( P′ h , V i) [ ( P′ ) ]h , V i ϕ J K γ J π j Kγ D cur J K γ v J π j Kγ = λv.JS2Khγ,vi Jλx.S1Khγ,vi(JvKhγ,vi) ′ ′ = ϕ (JP Khγ, JVπ j Kγ i)(JS Khγ,vi). = S S  λv.J 2t Khγ,vi Jλx. 1t Khγ,vi(Jvt Khγ,vi) = S S  By Hahn-Banach Theorem, λv.J 2t (λx. 1t )v Khγ,vi =  V , V = Jλv.S2t (λx.S1t )v Kγ D[J πi Kγ ]hvπ j J π j Kγ i  P′ V = S′ D[cur(J K)γ ]hv, J π j Kγ i J Khγ,vi. Hence (pb)  we have J(Ω (λy.P)) · SKγ Ω J ( λy.yπiyπ j )· ω VKγ = λxv.(JSKγ )(JPKhγ, xi)(D[cur(JPK)γ ]hv, xi) = λv.JωKγ Jyπiyπ j Khγ, JVKγ i D[ev ◦ hπi , πj i]hv, JVKγ )i = .( S )( P h , i)( [ P ]hh , i, h , ii)  λxv J Kγ J K γ x D J K 0 v γ x = λv.JωKγ JVπi Vπ j Kγ = λxv.(JSKγ )(JPKhγ, xi)   vπi (JVπ j Kγ ) + D[JVπi Kγ ]hvπ j , JVπ j Kγ i D[JPK]hh0, JvKhγ, x,vii, hhγ, x,vi, xii  ′ = λv.JωKγ JVπi Vπ j Kγ vπi (JVπ j Kγ ) + JS Khγ, JVπ j Kγ i = λxv.(JSKγ ) cur(JPK)γ (JxKhγ, x,vi) ′ ∗   = J(λv.vπi Vπ j + S [Vπ j /v]) · ω (Vπi Vπ j )Kγ JD(λy.P)· vKhγ, x,vi(JxKhγ, x,vi)    = λxv.(JSKhγ, x,vi) cur(JPK)hγ, x,vi(JxKhγ, x,vi) (20a) Say y is a free variable in E,  ∗ JD(λy.P)· vKhγ, x,vi(JxKhγ, x,vi) (Ω (λy.hy, Ei)) · ω V −→ (λv.hv, Si) ·  = λxv.JS (λy.D ) x D(λy.P )· v x Khγ, x,vi ω hV, E[V/y]i if (Ω λy.E)· ω V −→ t t t  ∗  = Jλxv.S (λy.D ) x D(λy.P )· v x Kγ (λv.S) · ω (E[V/y]). By IH, we have t t  t   ∗  J (Ω λy.E)· ω VK = J(λv.S) · ω (E[V/y])K, which     implies for any γ ∈ JΓK and v, JEKhγ, JVKγ i = JPKγ  Corollary 5.2 (Strong Normalization). Any reduction se- and D[JEKhγ, −i]hv, JVKγ i = JSKhγ,vi. Now, quence from any term is finite, and ends in a value. J (Ω (λy.hy, Ei)) · ω VKγ Proof. If P does not terminates, then we can form a reduc- = λv.JωKγ hJVKγ, JEKhγ, JVKγ ii  tion sequence in L that does not terminates using Lemma hD[Id]hv, JVKγ i, D[JEKhγ, −i]hv, JVKγ ii D  5.1 (1) and confluent property of differential λ-calculus, = λv.JωKγ hJVKγ, JE[V/y]Kγ i hv, JSKhγ,vii  proved in [15]. Then, this contradicts the strong normaliza- = λv.JhV, E[V/y]i ωKγ Jλv.hv, SiKγv    = J(λv.hv, Si)∗ · ω hV, E[V/y]iKγ tion property of differential λ-calculus.  (20b) If y < FV(E), we have (Ω (λy.hy, Ei)) · ω V −→ (λv.hv, 0i)∗ · ω hV, Ei and  J (Ω (λy.hy, Ei)) · ω VKγ = λv.JωKγ hJVKγ, JEKhγ, JVKγ ii  hD[Id]hv, JVKγ i, D[JEKhγ, −i]hv, JVKγ ii  = λv.JωKγ hJVKγ, JEKγ i hv, 0i  = λv.JωKγ JhV, Eiγ K Jλv.hv, 0iKγv  

  22