arXiv:2011.01833v1 [math.CA] 3 Nov 2020 a lrapoiain etrvle . vector-valued approximation, Taylor Keywords: [email protected] ihrodrdffrnilaayi ihvcoie derivat vectorized with analysis differential order Higher ∗ † eatmnod ae´tcs nvria eExtremadura de Universidad Matem´aticas, de Departamento ocretisiuinlalain ai,Fac.Email: France. Paris, affiliation, institutional current No ol sc sTyo’ hoe)cnb noprtdit n gene . and differential into order incorporated higher illustra be of we can framework theorem) exposition, Taylor’s our as complete (such To (Fa`a B tools rule di functions. chain a of and composition functions, the of product the for analy rule further Leibniz develop we identifications, o fundamental these with and From theorems, differentials between identification resu properties, the fundamental uniqueness are and The differentials of representations. algebra other this the for r using and so elegant unwield do an be to construct would which to variables, vector us of allows functions ord vector-valued This higher which vectors. in as representation, expressed dis vectorized and a advantages highlight own we its presents paper, each since mos betwe analysis choice the desired The the with representations. tensor representations, and alternative dimensio matrix multi-index, many high to structured are themselves functions lend of derivatives order Higher a iBuoscanrl,ietfiain ebi rdc r product Leibniz identification, rule, chain Bruno’s di Faa o´ .Chac´onJos´e E. oebr2020 November 4 Abstract ∗ 1 anDuong Tarn [email protected] -60 aao,San E-mail: Spain. Badajoz, E-06006 , † t htw establish we that lts i ol,icuiga including tools, tic nabtayorder. arbitrary an f nte eed on depends them en datgs nthis In advantages. ,i o impossible, not if y, uosfrua for formula) runo’s gru ler of algebra igorous aie ihnour within ralized a bet which objects nal rdrvtvsare derivatives er ocs existence concise oua being popular t l,mti calculus, matrix ule, ehwexisting how te ives 1 Introduction

Matrix differential calculus facilitates, and is sometimes even essential, for the analy- sis of statistical methodologies. Classical references on the topic include Searle (1982), Basilevsky (1983) or Graybill (1983). Though it is arguably due to the seminal mono- graphs of Schott (2017) and Magnus and Neudecker (2019), by illustrating applications regarding linear models, maximum likelihood and principal component analysis, that ma- trix analysis is firmly established in the suite of analytic tools for advanced statistical methodologies beyond the archetypal cases of, say, ordinary least squares or simple linear regression. Other important contributions have been made by Wand (2002) who illustrates the usefulness of vector differential calculus for generalized linear models, kriging, and sev- eral types of regression models, and by Chac´on and Duong (2018) who make extensive use of for the analysis of nonparametric kernel smoothers. In this paper we focus on establishing some fundamental results which underpin these statistical applications, such as the existence and uniqueness of higher order differen- tials/derivatives and their induced algebra, in a single rigorous framework. These results fill in many of the gaps in the currently sparse suite of analytical tools for higher order derivatives of vector-valued functions of vector variables. We begin our motivation by observing that the derivatives of a scalar-valued function of a vector variable f : Rd → R are intimately linked to the task of approximating such a function in a neighborhood of a given point c ∈ Rd (Magnus, 2010). The almost universally adopted definition of differentiability is as follows. The function f is said to be differentiable at c if there exists a linear function df(c; ·): Rd → R such that, for all u ∈ Rd,

f(c + u)= f(c)+ df(c; u)+Rec(u) (1)

⊤ 1/2 with the remainder satisfying Rec(u)/kuk→ 0 as u → 0, where kuk = (u u) denotes the norm of u. See, for example, Edwards (1973, p. 67). In this case, the function df(c; ·) is known as the differential of f at c, and the so-called first identification theorem (Magnus and Neudecker, 2019, Section 5.8) shows that df(c; u) = Df(c)⊤u, where the vector Df(c) ∈ Rd is the (or first ) of f at c. This vector is unique, in the sense that if there is another a ∈ Rd such that df(c; u) = a⊤u for all u ∈ Rd, then necessarily a = Df(c). Moreover, omitting the remainder terms of Equation (1), f(c + u) ≈ f(c)+ Df(c)⊤u is the best linear (or first-order) approximation of f in a neighborhood of c. Analogously, if f is differentiable at every point in some neighborhood of c, and all its partial derivatives are also differentiable at c, then f is said to be twice differentiable at c. In this case, then there exists a quadratic form d2f(c; ·): Rd → R, called the second

2 differential of f at c, such that

d 1 d2 f(c + u)= f(c)+ f(c; u)+ 2 f(c; u)+Rec(u) (2)

2 with Rec(u)/kuk → 0 as u → 0. The second identification theorem (Magnus and Neudecker, 2019, Section 6.8) states that d2f(c; u)= u⊤Hf(c)u, where the symmetric d × d matrix Hf(c) is called the (or ) of f at c. In this case, uniqueness is guaranteed up to symmetrization, in the sense that if B is another matrix such that d2f(c; u) = u⊤Bu for all u ∈ Rd, then necessarily Hf(c) = (B + B⊤)/2. Omitting the D ⊤ 1 ⊤H remainder terms of Equation (2), f(c + u) ≈ f(c)+ f(c) u + 2 u f(c)u is the best quadratic (or second-order) approximation of f in a neighborhood of c. First- and second-order Taylor approximations, along with their identification theo- rems, are well known. Indeed, for most applications a second-order approximation suffices, and so, formal generalizations of Equations (1) and (2) to higher order degrees are often sparsely treated or even omitted in standard textbooks. Nonetheless, higher order Taylor approximations are well-established research tools, and it would be erroneous to suggest that higher order derivatives and higher order approximations of functions are mere in- tellectual curiosities, as there are numerous applicative contexts where they are required. For instance, third- and fourth-order approximations are involved in the analysis of lo- cal quadratic regression (Ruppert and Wand, 1994, Section 3), fourth- and sixth-order derivatives appear in the expansion of the bias and variance of density curvature matrix estimators (Chac´on and Duong, 2010), and more generally, expansions of distribution and density functions (e.g. Edgeworth expansions) are defined for an arbitrary order depending on the regularity of the distributional moments (Kollo and von Rosen, 2005, Section 3.2). A widespread and elegant way to express these higher order Taylor approximations is via multi-index notation. Let m = (m1,...,md) be a multi-index, that is, a vector of d non-negative integers. Denote its modulus as |m| = i=1 mi, its generalized factorial as d m d mi m! = i=1 mi!, the element-wise exponentiation of xP= (x1,...,xd) as x = i=1 xi , Q m |m| m1 md Q and its induced of f at c as ∂ f(c) = (∂ /∂x1 · · · ∂xd )f(c). For r ≥ 1, the recursive definition of the r-order differentiability of f states that f is r times differentiable at c, if it is (r − 1) times differentiable in a neighborhood of c and all the partial derivatives ∂mf, with |m| = r − 1, are differentiable at c. In this case, we are able to write the rth order Taylor approximation of f as

∂mf(c) f(c + u)= um + Re (u) (3) m! c |mX|≤r

r with Rec(u)/kuk → 0 as u → 0. See Ash, Gatto and V´agi (1990) for a thorough discus- sion about the minimal assumptions necessary for Equation (3) to hold.

3 Whilst multi-indices offer a concise expression of higher order Taylor approximations, they possess several disadvantages: (i) the Taylor approximation terms, expressed as a multi-index summation, lack of an algebraic representation as a vector, a matrix or any other mathematical object which constitutes a basis for an algebra of differentials, (ii) the concision is not maintained if the infinitesimal element u has the form u = Uz for a fixed vector z and an infinitesimal matrix U, and (iii) perhaps most importantly, there are no results that guarantee the uniqueness of these expressions. We assert that, on the other hand, vectorized derivatives, due to their uniqueness and algebraic properties, are indeed a feasible candidate upon which to build a differential analysis framework. In Section 2, we exhibit the mathematical preliminaries required for our investigations into vectorized higher order derivatives. In Section 3, our main results for the identification of differentials and derivatives for scalar- and vector-valued functions are presented. In Section 4, we extend the basic identification results to rules for the derivatives of the product and the composition of functions. In Section 5, we provide concrete examples to illustrate the results from the two previous sections. In Section 6, the connections with some existing results are elaborated. We end with some concluding remarks.

2 Mathematical preliminaries

In the exploration of the broader question of the appropriate form of higher order deriva- tives of matrix-valued functions, Magnus (2010) reiterates compelling reasons to define the derivatives of vector-valued functions as a matrix, over alternative forms as a tensor or as a vector, according to the examples presented. Nonetheless, Pollock (1985) asserts the advantages of the tensor form and Holmquist (1996) of the vectorized form in other situations. Whilst the form of the derivative may appear to be an inconsequential theoretical detail, it turns out that the matrix/tensor form of the derivative was one of the key obstacles to solving some important applicative problems; for example, the expression of explicit formulas for moments of arbitrary order of the multivariate Gaussian distribution (Holmquist, 1988) or the analysis of general kernel smoothers (Chac´on and Duong, 2018) were solved using vectorized derivatives. Our proposed approach for the analysis of higher order derivatives combines the vec- torized form of Holmquist (1996) with the differential/derivative identification espoused by Magnus and Neudecker (2019). We begin with a definition of the required notations for our framework. For a matrix A, denote the rth Kronecker power of A as

r r matrices A⊗r = A = A ⊗···⊗ A . Oi=1 z }| {

4 ⊗r If A ∈ Mm×n (i.e., A is a matrix of order m × n) then A ∈ Mmr ×nr ; we adopt the convention A⊗1 = A and A⊗0 = 1 ∈ R. We also adopt the convention that a vector x = (x1,...,xd) is a column vector. So the differential operator with respect to the free vector variable x is denoted as ∂ ∂x1  ∂ . D = = . ∂x  .   ∂      ∂xd  and it is a column vector like x. Let f : Rd → R be a scalar-valued function of a d-dimensional vector variable. For r an arbitrary non-negative integer r, we consider the object D⊗rf(x) ∈ Rd as the rth derivative of f at x. This is a vector containing all the partial derivatives of order r of f at x, arranged in a convenient layout as defined by the formal Kronecker power of D. By the ‘formal Kronecker power’, we mean the product of the differential operator with itself which is obtained using the common notational convention that (∂/∂xi)(∂/∂xj ) = 2 2 ∂ /(∂xi∂xj) = ∂ /(∂xj ∂xi) = (∂/∂xj )(∂/∂xi) for all i, j. This commutativity is always guaranteed for a sufficiently regular f. Then we are able to write formally

∂rf(x) D⊗rf(x)= . (∂x)⊗r

Hence, the rth derivative of f is represented as a vector of length dr, and not an r-fold tensor array or a matrix. The gradient of f is D⊗1f = Df so there is no change from the usual derivative here. To observe a difference, we compute explicitly the second derivative. The vectorized Hessian operator is

∂2 ∂2 2 2  ∂x1   ∂x1  . .  .   .      ∂ ∂  ∂2   ∂2      ∂x1  ∂x1  ∂x1∂xd  ∂xd∂x1  ∂2 . .  .   .  D⊗2 = = . ⊗ . =  .  =  .  , (∂x)⊗2  .   .   .   .   ∂   ∂   2   2       ∂   ∂          ∂xd ∂xd ∂x ∂x  ∂x ∂x       d 1   1 d   .   .   .   .       ∂2   ∂2       ∂x2   ∂x2   d   d 

5 whereas the usual Hessian operator is

∂2 ∂2 ∂2 ∂2 2 . . . 2 . . .  ∂x ∂x1∂xd   ∂x ∂x1∂xd  ∂2 1 1 H DD⊤ . . . . = ⊤ = =  . .  =  . .  . ∂x∂x      ∂2 ∂2   ∂2 ∂2   . . .   . . .  ∂x ∂x ∂x2  ∂x ∂x ∂x2   d 1 d   1 d d  Therefore the Hessian Hf is such that vec Hf = D⊗2f, where the vec operator transforms a matrix into a vector by stacking its columns underneath each other. A vectorized form can also be used to express the derivatives of a vector-valued func- tion. If f : Rd → Rp is a vector-valued function of a vector variable with components f = (f1,...,fp), then we formally write the rth derivative of f at x as

⊗r D f1(x) ⊗r  .  D f(x)= . .  ⊗r   D fp(x)    Thus D⊗rf is a pdr-vector, i.e., we also arrange all partial derivatives in a vector form. For r = 1, this can be compared to a more traditional, matrix layout: the usual Jacobian matrix Jf is an arrangement of the of the component functions where

⊤ D f1(x) J  .  f(x)= . ∈ Mp×d.  ⊤   D fp(x)    This implies that the first vectorized derivative of f, as a column vector of stacked gradient functions, satisfies Df = vec J⊤f, echoing the relationship between the vectorized second derivative of a scalar-valued function with its Hessian matrix, D⊗2f = vec H⊤f = vec Hf. If we restrict ourselves to examining the first and second derivatives, then there is little gain with the vectorized formulation over the traditional formulation of treating the gradient as a vector/matrix and the Hessian as a matrix. At first glance this configuration of vectorized derivatives may even appear to be a counter-productive arrangement since it breaks the structure of matrix/tensor form of the derivative by rearranging them into a vector. On the other hand, vectorization ensures that we can proceed from the first to the second and to subsequent derivatives without having to change from vector to matrix to tensor. Moreover, for an r-times vector-valued differentiable function f, vectorization leads to an intuitive, iterative formula for the evaluation of an increment in the derivative order as D⊗rf = D(D⊗r−1f).

6 This internal consistency affords us many conceptual simplifications which facilitate im- portant advances in higher order differential analysis, which were not able to be treated using the multi-index representations of higher order derivatives. The most fundamen- tal of these is the existence and uniqueness of the identification between differentials and derivatives.

3 Identification theorems for vectorized higher order differ- entials

3.1 Scalar-valued functions

We begin with a scalar-valued function f : Rd → R, which we suppose to be r-times differentiable at c ∈ Rd. A common compact notation for its kth order partial derivatives is ∂k Dk c c i1···ik f( )= f( ) ∂xi1 · · · ∂xik for any k ≤ r and i1,...,ik ∈ {1,...,d}. Using this, the rth order differential of f at c can be expressed as a symmetric r-linear form:

d dr c u Dr c f( ; )= ui1 · · · uir i1···ir f( ) (4) i1,...,iXr=1

d for u = (u1, . . . , ud) ∈ R (Flett, 1980, p. 193; Schott, 2017, p. 389). This is another instance of the multi-index notation, since we can rewrite Taylor’s theorem in Equation (3) as r d c u c Dk c u f( + )= f( )+ ui1 · · · uik i1···ik f( )+Rec( ). Xk=1 i1,...,iXk=1 Nevertheless, each term of the sum in Equation (4) involves multiplying a certain rth order partial derivative by the corresponding coordinates of u, so the whole sum can be more concisely expressed using a vectorized derivative as drf(c; u)= D⊗rf(c)⊤u⊗r. Apart from concision, a further advantage of this vectorized representation is that it allows for more general forms for the infinitesimal element to be treated easily by apply- ing the usual algebraic properties of the Kronecker product. For example, if u = Uz for a fixed vector z and an infinitesimal matrix U, then drf(c; Uz) = D⊗rf(c)⊤(Uz)⊗r = D⊗rf(c)⊤U⊗rz⊗r. The compactness of this expression, together with the explicit sep- aration of the infinitesimal part from the other components, cannot be achieved by a multi-index representation. The ability to isolate the infinitesimals plays a key role in the development of new differential analysis results, such as the formulating the uniqueness properties of the following identification theorems.

7 To establish uniqueness, the symmetrizer matrix plays a crucial role. It was introduced by Holmquist (1985a) with the aim of obtaining a symmetrization of the Kronecker prod- uct. The symmetrizer matrix Sd,r is implicitly defined by the property that, for any choice v v Rd S v v v v of r vectors 1,..., r ∈ , it holds that r! d,r( 1 ⊗···⊗ r)= σ∈Pr ( σ(1) ⊗···⊗ σ(r)), where Pr stands for the set of all permutations of {1, . . . , r}.P It makes the Kronecker product symmetric in the sense that Sd,r(v1 ⊗···⊗ vr) = Sd,r(vτ(1) ⊗···⊗ vτ(r)) for r r any permutation τ ∈ Pr. Thus, Sd,r is a matrix of order d × d and it has the following explicit form (see, for example, Schott, 2003)

d r d r r ⊤ 1 ⊤ 1 Sd,r = ei e = ei ei r! ℓ iσ(ℓ) r! ℓ σ(ℓ) i1,...,iXr=1 σX∈Pr Oℓ=1 i1,...,iXr=1  Oℓ=1 n σX∈Pr  Oℓ=1 o where ei is the ith column of the d × d identity matrix Id. This expression reveals that the computation of Sd,r can be a complex and time-consuming task in practice, especially for large values of d and/or r, though we note that Chac´on and Duong (2015) developed efficient recursive algorithms to alleviate this problem. Our first main result is an identification theorem for differentials of arbitrary order with respect to the vectorized derivative, with a corresponding level of uniqueness, for a scalar-valued function of a vector variable.

Theorem 1 (Scalar-valued identification). Let the function f : Rd → R be r-times differ- entiable at c.

(i) If u ∈ Rd then the rth order differential of f at c with increment u is given by drf(c; u)= D⊗rf(c)⊤u⊗r.

dr r ⊤ ⊗r d ⊗r (ii) If a ∈ R satisfies d f(c; u)= a u for all u ∈ R , then D f(c)= Sd,ra.

Although rarely expressed with vectorized derivatives, Theorem 1(i) is already known, as is shown above. For Theorem 1(ii), it suffices to establish that a⊤u⊗r = 0 for all d u ∈ R if and only if Sd,ra = 0. This result, labeled as Lemma 1, is stated and proved in Appendix A.1.

For r = 1, since Sd,1 = Id, Theorem 1 agrees almost exactly with the first identifi- cation theorem of Magnus and Neudecker (2019, p. 96) except that the latter identifies the first derivative as a row vector (the Jacobian matrix) rather than a column vec- tor as we do. For r = 2, Theorem 1 agrees with the second identification theorem of Magnus and Neudecker (2019, p. 119), except that we express it as vectorized deriva- tive whilst Magnus and Neudecker express it as a Hessian matrix. These authors state 2 ⊤ that if A ∈ Md×d satisfies d f(c; u) = u Au then the Hessian matrix is identified as Hf(c) = (A+A⊤)/2. Using Theorem 1, since d2f(c; u)= u⊤Au = a⊤u⊗2 for a = vec A,

8 ⊗2 ⊤ then that yields D f(c)= Sd,2a = vec[(A+A )/2], by the properties of the symmetrizer matrix (Holmquist, 1996, Example 2.1); so exactly the same conclusion is reached, because D⊗2f(c) = vec Hf(c). Importantly, for r > 2 Magnus and Neudecker (2019) contains no further identification results for these higher order differentials, whereas Theorem 1 is valid for differentials of an arbitrary order.

In terms of uniqueness, since Sd,1 = Id, it is a strict uniqueness for the first order. On the other hand, it is uniqueness-after-symmetrization (i.e. pre-multiplication by the symmetrizer matrix Sd,r) for the second and higher orders. Magnus and Neudecker as- sert that the second derivative should be identified from d2f(c; u) = u⊤Au with its symmetrized version (A + A⊤)/2, rather than A on its own, even though there in- finitely many matrices that are different to A but which yield the same symmetrized sum. Their reasoning is equivalent to the pre-multiplication by Sd,2 in Theorem 1, since ⊤ Sd,2 vec A = vec[(A + A )/2]. However, for r> 2 there is no simple sequence of elementary matrix operations that can reproduce the action of the symmetrizer matrix, and the ensuing combinatorial explosion means that keeping track of which mixed partial derivatives are identical by construction, say in the multi-index representation, quickly becomes unwieldy. So the explicit exclusion of the symmetrizer matrix, based on the behavior for the first and second order, explains in part the hitherto lack of identification results for higher order differentials.

3.2 Vector-valued functions

Our next goal is to extend the identification in Theorem 1 to a vector-valued function d p f : R → R . Recall that when f has components (f1,...,fp), the rth order differential r d p r r r at c is a function d f(c; ·): R → R defined as d f(c; ·) = d f1(c; ·),..., d fp(c; ·) . A component-wise application of Theorem 1(i) yields 

⊗r ⊤ ⊗r (u ) D f1(c) dr  .  ⊤ ⊗r D⊗r f(c; u)= . = {Ip ⊗ (u ) } f(c)  ⊗r ⊤ ⊗r   (u ) D fp(c)    where the last equality follows from reasoning as in Chac´on and Duong (2018, Section 5.9). An alternative expression of the rth order differential is

⊗r ⊤ D f1(c) dr  .  ⊗r −1 D⊗r ⊤ ⊗r f(c; u)= . u = {vecdr,p f(c)} u ,  ⊗r ⊤   D fp(c)    −1 Rmn where vecm,n denotes the inverse of the isomorphism vec: Mm×n → . Lemma 2 in Appendix A.2 provides an explicit formula for this inverse operator, which allows us to

9 write further

dr −1 D⊗r ⊤ ⊗r D⊗r ⊤ ⊗r f(c; u)= {vecdr,p f(c)} u = {Ip ⊗ f(c) }{(vec Ip) ⊗ Idr }u .

Along with the previous formula, these three expressions for the rth order differential serve ⊤ ⊗r ⊗r different purposes. The first one {Ip ⊗ (u ) }D f(c) is minimal in the sense that it in- −1 D⊗r ⊤ ⊗r volves the least number of elementary operations. The second one {vecdr,p f(c)} u separates out the infinitesimal u⊗r, and is the most easily identifiable as the generalization of the differential for a scalar function, though this requires the introduction of the inverse ⊗r ⊤ ⊗r vector operator. The third one {Ip ⊗ D f(c) }{(vec Ip) ⊗ Idr }u is a compromise of these two where a separation of the infinitesimal is attained without vec−1, but with more involved operations.

Theorem 2 (Vector-valued identification). Let the function f : Rd → Rp be r-times dif- ferentiable at c.

(i) If u ∈ Rd then the rth order differential of f at c with increment u is given by dr ⊤ ⊗r D⊗r −1 D⊗r ⊤ ⊗r f(c; u)= {Ip ⊗ (u ) } f(c)= {vecdr,p f(c)} u . Rpdr dr −1 ⊤ ⊗r Rd D⊗r (ii) If a ∈ satisfies f(c; u) = (vecdr,p a) u for all u ∈ , then f(c) = r ⊤ ⊗r d (Ip ⊗ Sd,r)a. If A ∈ Mdr×p satisfies d f(c; u) = A u for all u ∈ R , then ⊗r D f(c) = (Ip ⊗ Sd,r) vec A = vec(Sd,rA).

Theorem 2(i) is shown above. The proof of Theorem 2(ii) is deferred to Appendix A.2. −1 Observe that A = vecdr,p a = a for p = 1, which ensures that, for the case of a scalar- valued function, Theorem 2 reduces to Theorem 1. Furthermore, observe that Sd,r = Id for r = 1, which implies that the symmetrizer matrix in effect is not involved in the ⊤ identification of the first derivative, since if df(c; u) = A u for some A ∈ Md×p then Df(c)= a = vec A. Theorems 1 and 2 are useful to obtain the rth order derivative by iterating from the first differential, which may require considerable matrix algebra to isolate the r-fold Kronecker product of the infinitesimal u⊗r. The following theorem provides an alternative with the identification of the rth derivative from the differential of the (r − 1)th order derivative.

Theorem 3 (Iterative identification). Let f : Rd → Rp be a function that is r-times differentiable at c, for some r> 1. Further suppose that its (r −1)th derivative, D⊗(r−1)f, d D⊗(r−1) ⊤ has been already obtained. If B ∈ Md×pdr−1 satisfies { f}(c; u) = B u for all u ∈ Rd, then D⊗rf(c) = vec B.

The proof is in Appendix A.3.

10 If we have that drf(c; u) = A⊤u⊗r and d{D⊗(r−1)f}(c; u) = B⊤u for all u ∈ Rd, ⊗r then Theorems 2 and 3 imply that vec(Sd,rA) = D f(c) = vec B, although A 6= B in general since A ∈ Mdr×p and B ∈ Md×pdr−1 . Therefore, A and B must contain the same elements but in a different layout. It is the joint action of the vectorization and the symmetrizer matrix that facilitates their re-arrangement into a common form D⊗rf(c). From Theorem 3, to obtain the rth derivative we are only required to compute a first order differential as a product of matrix and a single d-vector infinitesimal u at each iteration, which can be easier to compute than the rth order differential as a product of a matrix and a r-fold Kronecker product of the d-vector infinitesimal u⊗r required in Theorem 2.

3.3 Matrix-valued functions and functions of a matrix variable

We end with a discussion on our proposition for the derivative of a matrix-valued function d and a function of a matrix variable. Let F: R → Mp×q be a matrix-valued function of a vector variable. Following on from our treatment of vector-valued functions, it is straightforward to apply the identification in Theorem 2 to vec F: Rd → Rpq, since it is a vector-valued function of a vector variable. Thus the outstanding question is the analysis of functions of a matrix variable X. Even if it appears initially to be most intuitive to define derivatives with respect to X, e.g. as exposited in Kollo and von Rosen (2005, Section 1.4) and Magnus (2010), these same authors in their respective papers subsequently argue that this is not desirable for many reasons. Instead, they propose to also vectorize the free variable, that is to analyze vec F(X) with respect to vec X. Whilst they restrict themselves to the first and second order derivatives, in our case, we can appeal to Theorems 1 and 2 for arbitrary order derivatives. Although it is out of scope of this paper to settle definitively this difficult question of derivatives with respect to matrix variables, we highlight that our vectorizing approach offers systematic solutions to the key questions of how (i) to define the dimensions of the derivatives and (ii) to identify higher order differentials with their derivatives. Let the

(i, j)th component function of F be fij for i = 1,...,p, j = 1,...,q, and X ∈ Mc×d, then the rth derivative of vec F with respect to vec X is defined to be the vector

⊗r D f11(X) ⊗r  .  pqcrdr D vec F(X)= . ∈ R , (5)  ⊗r   D fpq(X)    ⊗r r ⊗r crdr where D fij(X)= ∂ fij(X)/(∂ vec X) ∈ R for each i, j.

11 Thus by enumerating the possible combinations in Equation (5), our answers to the former two questions are summarized in Table 1. This table contains the identifications for an arbitrary order r for all the combinations a scalar f ∈ R, vector f ∈ Rp and d matrix-valued F ∈ Mp×q function of a scalar x ∈ R, vector x ∈ R and matrix X ∈ Mc×d variable. Following the notational convention of Magnus and Neudecker (2019), we denote the infinitesimal as dx etc. in Table 1 rather than u as in the theorem statements.

Function Differential Derivative Dimension f(x) drf(x)= a(dx)r D⊗rf(x)= a R f(x) drf(x)= a(dx)r D⊗rf(x)= a Rp F(x) dr vec F(x)= a(dx)r D⊗r vec F(x)= a Rpq r ⊤ ⊗r ⊗r dr f(x) d f(x)= a (dx) D f(x)= Sd,ra R dr −1 ⊤ d ⊗r D⊗r S Rpdr f(x) f(x) = (vecdr,p a) ( x) f(x) = (Ip ⊗ d,r)a dr −1 ⊤ d ⊗r D⊗r S Rpqdr F(x) vec F(x) = (vecdr,pq a) ( x) vec F(x) = (Ipq ⊗ d,r)a r ⊤ ⊗r ⊗r crdr f(X) d f(X)= a (d vec X) D f(X)= Scd,ra R dr −1 ⊤ d ⊗r D⊗r S Rpcrdr f(X) f(X) = (veccrdr,p a) ( vec X) f(X) = (Ip ⊗ cd,r)a dr −1 ⊤ d ⊗r D⊗r S Rpqcrdr F(X) vec F(X) = (veccrdr,pq a) ( vec X) vec F(X) = (Ipq ⊗ cd,r)a

p Table 1: Higher order identifications, for the functions f ∈ R, f ∈ R , F ∈ Mp×q, and the d variables x ∈ R, x ∈ R , X ∈ Mc×d. The first column is the function, the second is the rth order differential, the third is the rth derivative and the fourth is the dimension of the vectorized derivative.

The differentials and derivatives of the vector- and matrix-valued functions in the sec- ond and third columns in Table 1 are vectors. In contrast, whilst Kollo and von Rosen (2005) and Magnus (2010) also define the differentials as vectors, they insist that matrix- valued derivatives be identified with these vector-valued differentials. For instance, Kollo and von Rosen

(2005, p. 126) express their preference to define the (first) derivative of f = (f1,...,fp) as ⊤ ∂f ⊤ ∂ ∂f1 ∂fp = f ⊗ = . . . ∈ Mcd×p, ∂ vec X ∂ vec X ∂ vec X ∂ vec X whereas Magnus (2010) employs the transpose of this arrangement

⊤ ∂f ∂ ⊤ ∂f ⊤ = f ⊗ = ∈ M . ∂ vec⊤ X ∂ vec X ∂ vec X p×cd

Our vectorized derivative in Equation (5) is related to the Kollo and von Rosen arrange- ment because it is the vectorization of the former: ∂f ∂ ∂f ⊤ Df = = f ⊗ = vec ∈ Rpcd, ∂ vec X ∂ vec X ∂ vec X

12 since we have a ⊗ b = vec(a⊤ ⊗ b) for any vectors a, b. On the other hand, since Df is the vectorization of the transpose of Magnus arrangement, it also retains the relationship with the vectorization of the transposed Jacobian for a function of a vector variable. Hence, Table 1 with r = 1 is essentially the same as Table 9.2 in Magnus and Neudecker (2019) for the first order identification. For the second order identification, since these matrix-valued derivative forms are composed of arrangements of blocks of matrices, these authors have only been able to define an identification with a second order derivative for a scalar-valued function f as ∂f/[(∂ vec X)(∂ vec⊤ X)] or ∂f/[(∂ vec⊤ X)(∂ vec X)]. Thus for the second order identification, Table 10.3 in Magnus and Neudecker (2019) contains only the rows of Table 1 with scalar f for r = 2. Whilst these authors establish further important properties for an algebra for their first derivatives, e.g. rules for the product and the composition of two functions, we demonstrate next that vectorized derivatives facilitate a systematic definition of these rules for arbitrary order derivatives.

4 Product and chain rules for vectorized higher order deriva- tives

We examine the product and chain rules for higher order derivatives. The for the multiplication by constants is the easiest to establish since the differential operator is a linear operator. The product rule for the product of two scalar-valued functions, also known as the general Leibniz rule, is well-known, though the case for the product of vector-valued functions remains largely unexamined. The for the composition of two functions is also known as the Fa`adi Bruno’s formula, see Avez (1997, Chapter 4) and Constantine and Savits (1996) for their treatment of scalar-valued functions. Again, the case for vector-valued functions remains largely unknown. With our vectorized differential analysis framework, we re-cast any existing results within our framework, and develop any hitherto unestablished results. For brevity, we provide the results for scalar- and vector-valued functions of vector variables, since the results (i) for scalar variables can be immediately inferred from the results for vector variables, and (ii) for matrix-valued functions and matrix variables, if they are vectorized beforehand, can be immediately inferred from those for vector-valued functions and vector variables. We begin with some rules for the derivative of a function multiplied by a constant coefficient (i.e. the latter does not involve the free variable). The proof of all the results in this section, which are given in Appendix A.4, make an extensive use of Theorems 1 and 2 to identify the derivatives from the differentials but, again for brevity, we state these

13 results only in terms of derivatives.

Theorem 4 (Constant multiplication). (i) Let the function f : Rd → R be r-times dif- ferentiable at c. If a ∈ Rq is a constant vector, then the rth derivative of af at c is r D⊗r(af)(c)= a ⊗ D⊗rf(c) ∈ Rqd .

(ii) Let the function f : Rd → Rp be r-times differentiable at c. If a ∈ Rq is a constant vector, then the rth derivative of a ⊗ f at c is D⊗r(a ⊗ f)(c) = a ⊗ D⊗rf(c) ∈ pqdr R . If A ∈ Mq×p is a constant matrix, then the rth derivative of Af at c is ⊗r ⊗r qdr D (Af)(c) = (A ⊗ Idr )D f(c) ∈ R .

Theorem 4 verifies that D⊗r demonstrates an expected behavior under constant mul- tiplication. From this, we next move onto the derivative of the product of two functions. Whilst the product rule for higher order derivatives of the product of two scalar-valued functions is well-established as the general Leibniz rule, we establish it for the Kronecker product of two vector-valued functions.

Theorem 5 (General Leibniz rule). (i) Let the functions f, g : Rd → R be r-times dif- ferentiable at c. Then the rth derivative of f · g at c is

r r r D⊗r(f · g)(c)= S D⊗r−jf(c) ⊗ D⊗jg(c) ∈ Rd . d,r j Xj=0

(ii) Let the functions f : Rd → Rp, g : Rd → Rq be r-times differentiable at c. Then the rth derivative of f ⊗ g at c is

r r r D⊗r(f⊗g)(c) = (I ⊗S ) vec vec−1 D⊗r−jf(c)⊗vec−1 D⊗jg(c) ∈ Rpqd . pq d,r j dr−j ,p dj ,q Xj=0  The terms in the latter summation can be simplified by introducing commutation matrices to express the vec of a Kronecker product of matrices in terms of the Kronecker product of the vectorized matrices (Magnus and Neudecker, 2019, Theorem 3.10), leading to

−1 D⊗r−j −1 D⊗j D⊗r−j D⊗j vec vecdr−j ,p f(c)⊗vecdj ,q g(c) = (Ip ⊗Kq,dr−j ⊗Idj ){ f(c)⊗ g(c)}.  The statement of the general Leibniz rule for scalar functions of a vector variable is usually expressed for each partial derivative singly with a multi-index notation, e.g. Constantine and Savits (1996, Lemma 2.6) or Hardy (2006), whereas Theorem 5(i) offers a concise, global expression containing all the rth order partial derivatives. Theorem 5(ii) extends the Leibniz rule to the Kronecker product of two vector-valued functions.

14 The last situation that we consider in this section concerns a formula for the higher order derivatives of the composition of two functions, where the composition is defined by (g ◦ f)(x) = g{f(x)}. From the usual chain rule for Jacobian matrices it immediately follows that, for a scalar-valued function g : Rp → R, the first derivative of the composition d p ⊤ with f : R → R at c is D(g ◦ f)(c) = Dg{f(c)} ⊗ Id Df(c), and for a vector-valued Rp Rq  D  −1 D ⊤ function g : → the former can be generalized to (g ◦f)(c)= [vecp,q g{f(c)}] ⊗ Id Df(c).  The goal is to derive a formula for the rth derivative of g◦f for an arbitrary r. The com- Nr r putation of this derivative involves the set Jr = {m = (m1,...,mr) ∈ 0 : ℓ=1 ℓmℓ = r} containing all the non-negative integer solutions of 1 · m1 + 2 · m2 + · · · + r ·Pmr = r, which r can be also expressed as Jr = k=1 Jk,r, with Jk,r = {m ∈ Jr : |m| = k}. Note that Voinov and Nikulin (1997, EquationS (11.18)) supply a computationally efficient algorithm

r mℓ for enumerating all elements of Jr. Further, let us denote πm = r!/ ℓ=1[mℓ!(ℓ!) ] for any m ∈Jr. Q

Theorem 6 (Fa`adi Bruno’s formula). Let the function f : Rd → Rp be r-times differen- tiable at c.

(i) Let g : Rp → R be r-times differentiable at f(c). Then the rth derivative of g ◦ f at c is r r ⊗r ⊗|m| ⊤ ⊗ℓ ⊗mℓ d D (g ◦ f)(c)= πm D g{f(c)} ⊗ Sd,r {D f(c)} ∈ R , mX∈Jr   Oℓ=1

D⊗|m| D⊗|m| ′ where g{f(c)} = g(c )|c′=f(c) denotes the |m|th derivative of g evaluated at f(c).

(ii) Let g : Rp → Rq be r-times differentiable at f(c). Then the rth derivative of g ◦ f at c is r r ⊗r −1 ⊗|m| ⊤ ⊗ℓ ⊗mℓ qd D (g ◦ f)(c)= π vec m D g{f(c)} ⊗ S {D f(c)} ∈ R . m p| |,q d,r mX∈Jr    Oℓ=1

Theorem 2.1 in Constantine and Savits (1996) also provides a higher order chain for- mula, but only for individual partial derivatives using multi-indices, whereas Theorem 6(i) offers a concise, global expression containing all the rth order partial derivatives. More- over, Constantine and Savits noted that it is highly difficult to obtain their results for g ◦ f, let alone g ◦ f, and so did not supply the latter, whereas Theorem 6(ii) follows naturally from Theorem 6(i).

15 5 Calculation examples for the Gaussian density function

Higher order differential analysis are of intense interest for the special case of the infinitely differentiable Gaussian density functions, due to its numerous statistical applications (see Chac´on and Duong, 2015). Let φΣ be the Gaussian density with mean 0 and variance Σ, i.e. φΣ(x) = (2π)−1/2|Σ|−1/2 exp(−x⊤Σ−1x/2). Then, Equation (2.1) in the pioneering paper of Holmquist (1996) states that

⊗r r −1 ⊗r D φΣ(x) = (−1) (Σ ) Hr(x; Σ)φΣ(x), (6) where Hr is the rth order vector Hermite polynomial, defined in Equation (3.3) in the same paper, as

⌊r/2⌋ (−1)j H (x; Σ)= r! S {x⊗r−2j ⊗ (vec Σ)⊗j}. r j!(r − 2j)!2j d,r Xj=0

This vectorized representation of Hermite polynomials, and hence of Gaussian density derivatives, is by far the most elegant and useful representation over multi-index, matrix or tensor ones, since Holmquist (1996) provides a comprehensive analysis of these vector Hermite polynomials. This includes principally their representations, orthogonal- ity properties, differential relations, recurrence relations, and averaging relations, that is, the equivalents to all the well-established properties of scalar Hermite polynomials. Whilst Equation (6) provides a concise expression for the derivatives of the Gaussian density, we are also able to calculate them directly from our results. Since the first two differentials are well-covered by other authors, and our approach does not differ much, we focus on the third differential for brevity. Following the notational convention, we denote the infinitesimal as dx when carrying out these calculations, rather than u in the above theorem statements.

5.1 Third order derivative from the third order differential

Let g(y) = (2π)−d/2|Σ|−1/2 exp(y) and f(x)= −x⊤Σ−1x/2, so that (g ◦ f)(x)= φΣ(x). Then, D⊗rg(y) = g(y) for all r, and therefore D⊗rg{f(x)} = φΣ(x). On the other d 1 d ⊤ −1d ⊤ −1d hand, the differentials and derivatives of f are f(x) = − 2 ( x Σ x + x Σ x) = −x⊤Σ−1dx so that Df(x)= −Σ−1x, and d2f(x)= −(dx⊤)Σ−1dx = −(vec Σ−1)⊤(dx)⊗2 ⊗2 −1 −1 −1 so that D f(x)= −Sd,2 vec Σ = − vec Σ from Theorem 1 (because Σ is symmet- ric), and D⊗rf(x) = 0 for all r ≥ 3. Hence, according to Theorem 6

⊤ ⊤ ⊤ −1 dφΣ(x)= [Dg{f(x)} ⊗ Id]Df(x) dx = −φΣ(x)x Σ dx. 

16 Applying Theorem 5, its second differential is

2 ⊤ −1 ⊤ −1 ⊤ −1 d φΣ(x)= −d{φΣ(x)x Σ dx} = −{dφΣ(x)}x Σ dx − φΣ(x)(dx )Σ dx ⊤ −1 2 ⊤ −1 = φΣ(x){(x Σ dx) − (dx )Σ dx} −1 −1 ⊤ ⊤ −1 ⊗2 = φΣ(x){(Σ x ⊗ Σ x) − vec Σ }(dx) and its third differential is

3 −1 −1 ⊤ ⊤ −1 ⊗2 d φΣ(x)= d[φΣ(x){(Σ x ⊗ Σ x) − vec Σ }(dx) ] ⊤ −1 −1 −1 ⊤ ⊤ −1 ⊗2 = −φΣ(x)(x Σ dx)[(Σ x ⊗ Σ x) − vec Σ ](dx) −1 −1 ⊤ −1 −1 ⊤ ⊗2 + φΣ(x)[(Σ dx ⊗ Σ x) + (Σ x ⊗ Σ dx) ](dx) .

In order to apply Theorem 1 to obtain the third derivative, we need to isolate (dx)⊗3 in the previous expression. Omitting the φΣ(x) factor, its first term is

(x⊤Σ−1dx){(Σ−1x ⊗ Σ−1x)⊤ − vec⊤ Σ−1}(dx)⊗2 = dx⊤(Σ−1x)(Σ−1x ⊗ Σ−1x − vec Σ−1)⊤(dx)⊗2 = {(dx)⊗3}⊤ vec{(Σ−1x)(Σ−1x ⊗ Σ−1x − vec Σ−1)⊤} = {(dx)⊗3}⊤{(Σ−1x)⊗3 − vec Σ−1 ⊗ Σ−1x} and its second term is

[(Σ−1dx ⊗ Σ−1x)⊤ + (Σ−1x ⊗ Σ−1dx)⊤](dx)⊗2 = dx⊤(Σ−1 ⊗ x⊤Σ−1 + x⊤Σ−1 ⊗ Σ−1)(dx)⊗2 = {(dx)⊗3}⊤ vec(Σ−1 ⊗ x⊤Σ−1 + x⊤Σ−1 ⊗ Σ−1) ⊗3 ⊤ −1 −1 −1 −1 = {(dx) } {(Id ⊗ Kdd)(vec Σ ⊗ Σ x)+ Σ x ⊗ vec Σ } using the decompositions of the vectorized matrix product vec(ABC) = (C⊤ ⊗ A) vec B, and of the vectorized Kronecker product vec(A ⊗ B) in terms of vec A ⊗ vec B with the help of the commutation matrix Kdd. We leave it to the interested reader to consult Magnus and Neudecker (2019) for these details. Therefore, the third order differential is

3 −1 −1 −1 −1 −1 −1 d φΣ(x)= φΣ(x){vec Σ ⊗ Σ x + (Id ⊗ Kdd)(vec Σ ⊗ Σ x)+ Σ x ⊗ vec Σ − (Σ−1x)⊗3}⊤(dx)⊗3 −1 −1 −1 ⊗3 ⊤ d ⊗3 = φΣ(x)[{Kd2,d + (Id ⊗ Kdd)Kd2,d + Id3 }(Σ x ⊗ vec Σ ) − (Σ x) ] ( x)

17 If we apply Theorem 1(ii), the derivatives are

−1 −1 DφΣ(x)= −φΣ(x)Sd,1Σ x = −φΣ(x)Σ x ⊗2 −1 −1 −1 D φΣ(x)= φΣ(x)Sd,2(Σ x ⊗ Σ x − vec Σ ) −1 −1 −1 = φΣ(x)(Σ x ⊗ Σ x − vec Σ ) D⊗3 −1 −1 −1 ⊗3 φΣ(x)= φΣ(x)Sd,3[{Kd2,d + (Id ⊗ Kdd)Kd2,d + Id3 }(Σ x ⊗ vec Σ ) − (Σ x) ] −1 ⊗3 −1 −1 = −φΣ(x)Sd,3{(Σ x) − 3Σ x ⊗ vec Σ } since Sd,3{Kd2,d + (Id ⊗ Kdd)Kd2,d + Id3 } = Sd,3{(Id ⊗ Kdd)(Kdd ⊗ Id)+ Kdd ⊗ Id +

Id3 } = 3Sd,3 from a combination of Schott (2017, Theorem 8.29) and Schott (2003, The- orem 1(iv)).

5.2 Third order derivative from the differential of the second order derivative

We compare the calculations of the derivatives of a Gaussian density directly from higher order differentials above to the iterative approach in Theorem 3. From above, the second derivative is D⊗2φΣ(x)= φΣ(x)(Σ−1x ⊗ Σ−1x − vec Σ−1), and its differential is

⊗2 −1 −1 −1 d{D φΣ(x)} = d{φΣ(x)(Σ x ⊗ Σ x − vec Σ )} −1 −1 −1 −1 −1 −1 −1 = {dφΣ(x)}(Σ x ⊗ Σ x − vec Σ )+ φΣ(x)(Σ dx ⊗ Σ x + Σ x ⊗ Σ dx) −1 −1 −1 −1 ⊤ = −φΣ(x)(Σ x ⊗ Σ x − vec Σ )(Σ x) dx −1 −1 −1 −1 + φΣ(x)(Σ ⊗ Σ x + Σ x ⊗ Σ )dx −1 −1 −1 −1 ⊤ −1 −1 −1 −1 = −φΣ(x){(Σ x ⊗ Σ x − vec Σ )(Σ x) − Σ ⊗ Σ x − Σ x ⊗ Σ }dx.

This is a d2-vector and involves only a single first order infinitesimal dx, in comparison to the third order differential d3φΣ, which is a scalar and involves the 3-fold Kronecker product of the infinitesimal (dx)⊗3. Applying Theorem 3, the third derivative is

⊗3 −1 −1 −1 −1 ⊤ −1 −1 D φΣ(x)= −φΣ(x) vec[{(Σ x ⊗ Σ x − vec Σ )(Σ x) − Σ ⊗ Σ x − Σ−1x ⊗ Σ−1}⊤] −1 ⊗3 −1 −1 −1 −1 = −φΣ(x){(Σ x) − vec Σ ⊗ Σ x − (Id ⊗ Kdd) vec Σ ⊗ Σ x − Σ−1x ⊗ vec Σ−1}.

This derivative is, of course, identical to that obtained in the previous section since −1 ⊗3 −1 ⊗3 −1 −1 −1 −1 Sd,3(Σ x) = (Σ x) and also 3Sd,3(Σ x ⊗ vec Σ ) = vec Σ ⊗ Σ x + (Id ⊗ −1 −1 −1 −1 Kdd) vec Σ ⊗ Σ x + Σ x ⊗ vec Σ from the explicit form of Sd,3 given in Holmquist (1996, Example 2.1), after noting a minor typographical error in the latter (the subscript d2, d in the fourth term should read d, d2).

18 5.3 Third and higher order derivatives from Fa`adi Bruno’s formula

In another calculation, we can compute the third order differential of Gaussian density directly from Fa`adi Bruno’s formula. Writing φΣ(x) = (g ◦ f)(x) as in Section 5.1 and taking into account the derivatives of f and g obtained there, from Theorem 6 therefore we have

3 3 d φΣ(x)= d (g ◦ f)(x) D⊗3 ⊤ D ⊗3 D⊗2 ⊤ D D⊗2 = g{f(x)} ⊗ Id3 { f(x)} + 3 g{f(x)} ⊗ Id3 { f(x) ⊗ f(x)}    ⊤   D ⊤ D⊗3 d ⊗3 + g{f(x)} ⊗ Id3 f(x) ( x)   −1 ⊗3  −1 −1 ⊤ ⊗3 = −φΣ(x){(Σ x) − 3Σ x ⊗ vec Σ } (dx)

⊗3 −1 ⊗3 −1 −1 and hence the third derivative is D φΣ(x)= −φΣ(x)Sd,3{(Σ x) −3Σ x⊗vec Σ }, which is again identical to the previous calculations. It is even possible to derive a simple proof for the rth derivative of the Gaussian density from Fa`adi Bruno’s formula. Start with r ⊗r ⊗|m| ⊗ℓ ⊗mℓ D φΣ(x)= πmD g{f(x)}Sd,r {D f(x)} (7) mX∈Jr Oℓ=1 and recall that m = (m1,...,mr) ∈ Jr are the non-negative solutions to the linear ⊗r Diophantine equation 1 · m1 + 2 · m2 + · · · + r · mr = r. Since D f(x)= 0 for all r ≥ 3, then the terms in (7) will be identically zero whenever mℓ > 0 for some ℓ ≥ 3. So it suffices to consider m ∈Jr with m3 = · · · = mr = 0, which simplifies the Diophantine equation to m1 + 2m2 = r. Since m1 ≥ 0 and m2 ∈ N0, the former equation implies that m2 ≤ ⌊r/2⌋, so all its solutions are given by m1 = r − 2j, m2 = j for j = 0, 1,..., ⌊r/2⌋. Then r mℓ j the coefficient πm = r!/ ℓ=1{mℓ!(ℓ!) } has the simpler form πm = r!/{(r − 2j)!j!2 }. Combining these with D⊗Qrg{f(x)} = φΣ(x) for all r, Df(x) = −Σ−1x, and D⊗2f(x) = − vec Σ−1, we have ⌊r/2⌋ ⊗r r! ⊗r−2j ⊗2 ⊗j D φΣ(x)= φΣ(x) S {Df(x)} ⊗ {D f(x)} (r − 2j)!j!2j d,r Xj=0   ⌊r/2⌋ r! −1 ⊗r−2j −1 ⊗j = φΣ(x) S (−Σ x) ⊗ (− vec Σ ) (r − 2j)!j!2j d,r Xj=0  ⌊r/2⌋ j r −1 ⊗r (−1) r! ⊗r−2j ⊗j = (−1) (Σ ) φΣ(x) S x ⊗ (vec Σ) . (r − 2j)!j!2j d,r Xj=0 

The summation is identical to the vector Hermite polynomial Hr(x; Σ) introduced previ- ously. This derivation is an alternative to the one based on a formal Taylor expansion provided by Holmquist (1996).

19 6 Connections to existing results

In Sections 3 and 4, we laid the foundations for a rigorous framework for a differential calculus for vector-valued functions of vector variables. In this section, we continue to elaborate it by contextualizing existing results within this framework.

6.1 Taylor’s theorem with vectorized derivatives

We return to our motivating example of Taylor approximations in the Introduction. Whilst Taylor polynomials are well-known for scalar functions of a vector variable, via their characterization with multi-indices in Equation (3), we observe that Theorem 2 allows for their characterization with vectorized derivatives. If f : Rd → R is a function such that every element in D⊗jf(x), 0 ≤ j ≤ r is piecewise continuous, then Theorem 3.11.10 from Baxandall and Liebeck (1986) states that rth order Taylor polynomial approximation is given by r 1 f(x + u)= (u⊤D)jf(x)+Re (u), j! c Xj=0 d r where u ∈ R and Rec(u)/kuk → 0 as u → 0. This form is not amenable for our purposes since it combines the infinitesimal u with the action of the differential operator D. Using the identity (a⊤b)j = (a⊤b)⊗j = (a⊤)⊗jb⊗j for vectors a, b of the same length, we can extricate the role of D from u to obtain an alternative expansion

r 1 ⊤ ⊗j ⊗j f(x + u)= (u ) D f(x)+Rec(u). (8) j! Xj=0

Equation (8) is a stepping stone to the development of Taylor polynomials for a vector- valued function.

Theorem 7 (Vector-valued Taylor approximation). Let x and x + u be distinct points in an open subset Ω ⊆ Rd such that the straight line segment joining x and x + u lies in Ω. Let f : Rd → Rp be a vector-valued function that is r times continuously differentiable on Ω. The rth order Taylor polynomial approximation of f is given by

r 1 f(x + u)= I ⊗ (u⊤)⊗j D⊗jf(x)+Re (u) j! p c Xj=0 

r where Rec(u) is such that Rec(u)/kuk → 0 as u → 0.

The proof of Theorem 7 was demonstrated in Chac´on and Duong (2018, Theorem 5.8).

20 For the special case where the vector-valued function f can be expressed as the sth s derivative of a scalar-valued function g, i.e. g : Rd → R, f : Rd → Rd and f = D⊗sg, then we have r 1 ⊤ ⊗j ⊗s+j f(x + u)= [I s ⊗ (u ) ]D g(x)+Re (u) j! d c Xj=0 r 1 ⊗s ⊤ ⊗j ⊗s+j = [I ⊗ (u ) ]S D g(x)+Rec(u). j! d d,s+j Xj=0 The action of the symmetrizer matrix in the matrix product in the summand can therefore be also interpreted as permuting the order of each of the s identity matrices Id and the j infinitesimals u, rather than on D⊗s+jg(x). The latter are usually less easily expressed as an (s + j)-fold Kronecker product.

6.2 General approximation of the identity with vectorized derivatives

We analyze further the role of Taylor expansions in the approximation of the identity. Let a kernel K : Rd → R be an integrable scalar-valued function with unit integral, and f : Rd → R be another integrable function. The convolution of K and f is defined by d K ∗ f(x) = Rd K(x − y)f(y) y, and it inherits the differentiability properties of K (Wheeden andR Zygmund, 1977, Theorem 9.3). Therefore K ∗ f can be interpreted as a d kernel-smoothed version of f. Moreover, if a rescaled version Kh(x) = K(x/h)/h is considered, with a smoothing parameter h> 0, then Kh ∗ f(x) → f(x) in various senses as h → 0 (Wheeden and Zygmund, 1977, Section 9.2). When this convergence holds, then the family of functions {Kh}h>0 is known as an approximation of the identity, or a mollifier. These convergence properties of Kh ∗ f rely on Taylor expansions where h is an infinitesimal element.

The rescaling Kh is commonly referred as a spherical rescaling, since it applies the same scaling factor for all coordinate directions x1,...,xd. An elliptical rescaling, Kh(x) =

K(x1/h1,...,xd/hd)/(h1 · · · hd), where h = (h1,...,hd) is a vector of possibly differ- ent positive scaling factors, allows for a different rescaling for each coordinate direc- tion. But the most general rescaling is obtained using KH(x) = |H|−1/2K(H−1/2x), where H is a symmetric positive-definite matrix (i.e., H > 0) and H−1/2 is such that H−1/2H−1/2 = H−1. This general form, which subsumes the spherical and elliptical rescalings as special cases, additionally allows an arbitrary rotation before the elliptical rescaling, so the rescaling is no longer restricted to follow the coordinate directions. In the context where f is a multivariate probability density function, Wand and Jones (1993) showed that approximations using this unconstrained scaling can lead to substantial gains in accuracy in statistical estimation.

21 The proof of the convergence of the spherical mollifiers Kh ∗ f carries over with minor adjustments to the general, unconstrained case of KH ∗ f, so that KH ∗ f(x) → f(x) as vec H → 0 in the same senses as previously. However quantifying the rate of convergence of KH ∗f(x)−f(x) to zero requires a more involved analysis of the general approximation of the identity {KH}H>0. We begin with

KH ∗ f(x) − f(x)= {f(x − y) − f(x)}KH(y) dy ZRd = f(x − H1/2z) − f(x) K(z) dz. ZRd  If f is 2k times differentiable at x, then we can use a Taylor expansion with vectorized 1/2 2k j 1 D⊗j ⊤ 1/2 ⊗j ⊗j derivatives to approximate f(x − H z) − f(x) ≃ j=1(−1) j! f(x) (H ) z . ⊗j d Furthermore, if K is a kernel of order 2k, i.e. meaningP that µj(K)= Rd z K(z) z = 0 ⊗2k d for j = 1,..., 2k − 1 and µ2k(K)= Rd z K(z) z 6= 0, then it followsR that R 1 ⊗2k ⊤ 1/2 ⊗2k KH ∗ f(x) − f(x) ≃ D f(x) (H ) µ (K). (9) (2k)! 2k

Equation (9) shows that vectorized derivatives in the Taylor polynomial allow for the separation of a matrix-valued infinitesimal H1/2 and the free variable z, so that KH ∗ f(x)−f(x) can be separated into a vectorized derivative of f, an infinitesimal element, and a vectorized moment of K. So it is straightforward to assert, for a general approximation of the identity {KH}H>0 with a (2k)th-order kernel, that KH ∗ f(x) − f(x) converges to zero at the same rate as (H1/2)⊗2k.

6.3 Individual partial derivatives within vectorized derivatives

Our proposed derivative consists of a systematic ordering of all the possible higher order partial derivatives as a single vectorized derivative. This is a basic property in building our proposed algebra of differentials. Nonetheless, there are situations where explicit Dr knowledge of the location of certain mixed partial derivatives i1···ir f is important, e.g., (i) to diagonalize the derivative which involves the extraction of elements on the main r r d 2 2 diagonal ∂ /∂xi , i = 1,...,d; (ii) to obtain the Laplacian △ = i=1 ∂ /∂xi ; (iii) to express a multivariate density function f in terms of its distributionP function F as f = d ∂ F/(∂x1 · · · ∂xd). Whilst this is trivial for multi-index or matrix or tensor representations of higher order derivatives, for vectorized representations it requires a separate procedure.

Given the indices i1,...,ir ∈ {1,...,d}, the problem is to locate the p = r Dr D⊗r p(i1,...,ir) ∈ {1,...,d } where the partial derivative i1···ir lies within the vector , Dr D⊗r e⊤D Dr i.e., such that i1···ir = ( )p. But starting from i = ∂/∂xi it is clear that i1...ir = ⊤D⊗r (ei1 ⊗···⊗ eir ) , so the problem reduces to locating the only nonzero element of

22 r ℓ=1 eiℓ and, reasoning as in Lemma 1.3.1 in Kollo and von Rosen (2005), it follows that r r−j Nsuch an element is at position p = p(i1,...,ir)=1+ j=1(ij − 1)d . Hence, locating an individual partial derivative within the derivative vecPtor is trivial. Moreover, since our regularity conditions ensure that Dr = Dr for any per- i1···ir iσ(1)···iσ(r) r mutation σ ∈ Pr, then in fact any map pσ(i1,...,ir)= p(iσ(1),...,iσ(r))=1+ j=1(iσ(j) − r−j Dr D⊗r P 1)d is also valid to locate i1···ir within . For instance, Chac´on and Duong (2015) −1 used pσ for the permutation σ such that σ (j)= r − j + 1. The map p: {1,...,d}r → {1,...,dr}, defined between these two sets of the same cardinality, can be shown to be a bijection. Hence, its inverse function p−1 is also useful to find out the multi-index form of a partial derivative located at a given coordinate of D⊗r, as detailed in Appendix 2 in Chac´on and Duong (2015). Intuitively, p−1 is in effect a change of base of an integer ∈ {1,...,dr} from base-10 to base-d, though using the numerals drawn from {1,...,d} instead of the usual {0,...,d − 1}.

6.4 Vectorized higher order moments and cumulants

Let X be a d-variate random vector. For a multi-index i1,...,ir ∈ {1,...,d}, it is common E to refer to the expected value (Xi1 · · · Xir ) as a mixed moment of order r. There are many of these individual real-valued mixed moments, and they are all contained in the ⊗r dr rth order vectorized moment µr = E(X ) ∈ R . As is the case for the individual mixed partial derivatives, even if particular mixed moments can be of interest in some situations, the whole vectorized moment is needed for the expansion of the characteristic or the moment-generating function of X. Indeed, the moments of a random variable and derivatives are closely related since the former can be obtained via the derivative of the moment-generating function at zero. Hence, the arrangement of the rth order moment is immediately inherited from the cor- responding layout of the rth order derivative. In our case, if we compute the expectations and derivatives from first principles, then we can demonstrate in Lemma 3 that

⊗r ⊗r µr = E(X )= D MX(t)|t=0 (10) where MX(t) = E{exp(t⊤X)} denotes the moment-generating function of X. This ap- proach via vectorized derivatives was the only successful tool to find the moments of arbitrary order of a multivariate normal vector (Holmquist, 1988), after several authors previously focused only on finding moments of certain particular orders. Kollo and von Rosen (2005, p. 173) entertained the possibility of defining the rth mo- ment as E(X⊗r), though they eventually argued against this configuration due to that “it is complicated to show where the moments of interest are situated in the vector”. With the map p introduced in Section 6.3, to carry out this localization no longer involves

23 any difficult procedures. Moreover, these authors define a derivative that leads to the E ⊤ ⊗r−1 rth moment of X as {X(X ) } ∈ Md×dr−1 (Kollo and von Rosen, 2005, Theorem 2.1.1). This indeed exhibits the nice feature that the covariance is a d × d matrix (anal- ogously to the Hessian matrix), but it does not represent a conceptual advantage over ⊗r ⊤ ⊗r−1 µr = E(X ) = vec(E{X(X ) }) in terms of higher order moments or the location of individual mixed moments, so we maintain our preference for the vectorized form µr. The cumulants of the random vector X provide an alternative to moments, which is par- ticularly useful for Edgeworth and related expansions of distributions (Kollo and von Rosen, 2005, Section 3.2). The cumulant-generating function of X is given by CX(t) = log MX(t) for t ∈ Rd and, by analogy with the moments, the rth order vectorized cumulant is defined ⊗r as κr = D CX(t)|t=0. Cumulants and moments are closely connected and it is useful to express the former in terms of the latter and vice versa (see Holmquist, 1985b). These relationships are usually difficult to describe, however, they follow easily from the higher order chain rule in Theorem 6. By expressing MX(t) = (g ◦ CX)(t) with g(y) = exp(y), it readily follows from Theorem 6 that

r S ⊗mℓ µr = πm d,r κℓ , mX∈Jr Oℓ=1 which corresponds to Theorem 4.1(i) in Holmquist (1985b). Reciprocally, by writing CX(t) = (g ◦ MX)(t) with g(y) = log(y), Theorem 6 immediately gives

r |m|−1 S ⊗mℓ κr = πm(−1) (|m|− 1)! d,r µℓ , mX∈Jr Oℓ=1 which agrees with Theorem 4.1(iii) in Holmquist (1985b).

6.5 Unique vectorized moments and partial derivatives

If f : Rd → R is r times differentiable then D⊗rf is a vector of length dr, but not all ⊗r of its entries are distinct. The same occurs for the vectorized moment µr = E(X ), which includes many redundant mixed moments. This is not a problem for theoretical developments; on the contrary, having a neat configuration has proved to be beneficial (and even essential) for building an algebra of differentials and unveil general results that are valid for any arbitrary order r. However, for computational purposes, it can be more efficient to first calculate only the distinct elements and, if required, then redistribute them to form the full derivative vector of length dr. The number of distinct partial derivatives is the same as the number of distinct mixed moments, and they equal the number of monomials of degree r in d d+r−1 variables, i.e. r ; see Feller (1968, Section II.5). As d increases, the proportion of  24 ⊗r distinct elements in either D f or µr approaches 1/r!, so the time savings from computing only the distinct elements of these vectors can be considerable, even for moderate values of r. The vector containing all the distinct rth order mixed moments is called the rth order minimal moment in Kollo and von Rosen (2005, Section 2.1.6), and Theorem 2.1.10 in this reference shows how to obtain µr from this minimal representation. The distinct partial derivatives of a function are called the unique partial derivatives in Chac´on and Duong (2015, Section 5), where an efficient recursive algorithm to compute these unique partial derivatives (and the subsequent entire derivative vector) of the multivariate Gaussian density is exhibited.

7 Conclusion

We have introduced a rigorous, comprehensive framework for the differential analysis for vector-valued functions of vector variables. The foundations of this analytic framework are the existence and uniqueness of the identifications between the differentials and the derivatives of any order. These existence and uniqueness properties have hitherto resisted a sufficiently complete characterization. The latter in turn facilitates the construction of an algebra of differentials/derivatives that is an intuitive generalization of that which exists for scalar-valued functions of scalar/vector variables. We established two fundamental rules of this algebra in order to compute higher order derivatives: (i) a Leibniz rule for the product of two functions and (ii) Fa`adi Bruno’s rule for the composition of two functions. In addition to these foundational results, we demonstrated that well-known analytic results (such as Taylor’s theorem, Hermite poly- nomials, and the relationship between moments and cumulants) can be re-cast within this framework, often with a considerable simplification of their development, and crucially with the ability to be generalized in an intuitive manner to any dimension and to any derivative order.

Acknowledgements. The authors are grateful to Professor Bj¨orn Holmquist for kindly sharing a copy of his unpublished research report. J. E. Chac´on has been partially sup- ported by the Spanish Ministerio de Ciencia e Innovaci´on grant PID2019-109387GB-I00 and the Junta de Extremadura grant GR18016.

25 A Appendix: Supporting lemmas and proofs

A.1 Scalar-valued identification

d ⊤ ⊗r d Lemma 1. Let a ∈ R . Then, a x = 0 for all x ∈ R if and only if Sd,ra = 0.

Proof. If Sd,ra = 0, using the properties of the symmetrizer matrix (Schott, 2003), then

⊤ ⊗r ⊤S ⊗r ⊤S⊤ ⊗r S ⊤ ⊗r a x = a d,rx = a d,rx = ( d,ra) x = 0.

To show the reverse implication assume that a⊤x⊗r = 0 for all x ∈ Rd and note, from the explicit representation of Sd,r, that it suffices to show that for any choice of i1,...,ir ∈ {1,...,d}, that we have

⊤ a eiσ(1) ⊗···⊗ eiσ(r) = 0. (11) σX∈Pr 

In order to prove Equation (11), we introduce some notation. For a vector j = (j1,...,jk) of indices, let |{j}| denote the number of its distinct coordinates; that is, the cardinality of the set {j1,...,jk}. Given k different indices i1,...,ik ∈ {1,...,d} with k ≤ r, let

Ir,p = Ir,p(i1,...,ik) be the set of r-dimensional vectors of indices in {i1,...,ik} having exactly p different coordinates (p ≤ k); that is,

r Ir,p = Ir,p(i1,...,ik)= j = (j1,...,jr) ∈ {i1,...,ik} : |{j}| = p .  First we claim that, from the fact that a⊤x = 0 for all x ∈ Rd, it follows that

⊤ a (ej1 ⊗···⊗ ejr ) = 0. (12) j∈IXr,k

To assert Equation (12), we proceed by induction on k. For k = 1, this statement affirms ⊤ ⊗r that for any i ∈ {1,...,d} we have a ei = 0, which is trivially true by taking x = ei in the hypothesis. So assume by induction that the result is true for any set of k − 1 different indices and we will demonstrate that Equation (12) holds for any k ≤ r different indices r k ∅ i1,...,ik ∈ {1,...,d}. Notice that {i1,...,ik} = p=1 Ir,p with Ir,p ∩Ir,q = for p 6= q, S so that taking x = ei1 + · · · + eik it follows that

k ⊤ ⊗r ⊤ ⊤ 0= a (ei1 + · · · + eik ) = a (ej1 ⊗···⊗ ejr )= a (ej1 ⊗···⊗ ejr ). r j∈{iX1,...,ik} Xp=1 j∈IXr,p

By the induction hypothesis the right-hand-side reduces to a⊤ (e ⊗···⊗ e ), so j∈Ir,k j1 jr this yields Equation (12). P

26 Finally, we use Equation (12) to assert Equation (11). Consider any indices i1,...,ir ∈

{1,...,d} and denote |{i1,...,ir}| = k, with k ≤ r. If all the indices are different, then k = r and, moreover,

{(iσ(1),...,iσ(r)): σ ∈ Pr} = Ir,r(i1,...,ir), so in this case Equation (11) follows directly from Equation (12). When k < r, it is sufficient that to demonstrate that the sum on the left hand side of Equation (11) is proportional to the sum on the left hand side of Equation (12). More precisely, denote

{i1,...,ir} = {ı1,...,ık}, with ı1,...,ık ∈ {1,...,d}, to represent the distinct coordinates of (i1,...,ir). For any ℓ ∈ {1,...,k}, write rℓ for the number of times that ıℓ appears in

(i1,...,ir), so that 1 ≤ rℓ ≤ r for all ℓ = 1,...,k and r1 + · · · + rk = r. Then,

(eiσ(1) ⊗···⊗ eiσ(r) )= r1! · · · rk! (ej1 ⊗···⊗ ejr ). σX∈Pr j∈Ir,kX(ı1,...,ık) The former equation can be explicitly shown in the same way as in the combinatorial proof that r!/(r1! · · · rk!) is the number of permutations with repetition of the elements of

{ı1,...,ık} with the element ıℓ repeated rℓ times, ℓ = 1,...,k (see Brualdi, 2010, Theorem

2.4.2). The set of all permutations of (i1,...,ir) is, in fact, a multi-set (there are repeated elements) of cardinality r!, whose elements are all the aforementioned permutations with repetitions, so that each of these permutations with repetitions appears exactly r1! · · · rk! times in the set of all permutations of (i1,...,ir). This establishes the proportionality of the summations on the left hand sides of Equations (11) and (12).

A.2 Vector-valued identification

mn Lemma 2 (Inverse vector operator). The inverse of the isomorphism vec: Mm×n → R −1 ⊤ Rmn is given by vecm,n(x)= {(vec In) ⊗ Im}(In ⊗ x) for any x ∈ . Proof. This result elaborates on the entry ‘What is the inverse of the vec operator?’ of the webpage math.stackexchange.com which contains a slightly incomplete (and different) proof. −1 Rn We begin by showing that vecm,n(vec A)= A for any A ∈ Mm×n. Let ei ∈ be the n ⊤ ith column of In. Then, ai = Aei is the ith column of A and we can write A = i=1 aiei n ⊤ and In = i=1 eiei . By making use of the usual properties of the vec operatorP and the KroneckerP product we have n −1 ⊤ ⊤ ⊤ vecm,n(vec A)= {(vec In) ⊗ Im}(In ⊗ vec A)= (ei ⊗ ei ⊗ Im)(In ⊗ vec A) Xi=1 n n n ⊤ ⊤ ⊤ ⊤ = (ei In) ⊗ {(ei ⊗ Im) vec A} = ei ⊗ (Aei)= aiei = A. Xi=1 Xi=1 Xi=1

27 −1 Rmn On the other hand, we need to show that vec{vecm,n(x)} = x for any x ∈ . Using Theorem 3.10 in Magnus and Neudecker (2019),

vec(In ⊗ x) = [{(In ⊗ K1,n) vec In}⊗ Imn]x = {(vec In) ⊗ Imn}x, since the commutation matrix satisfies K1,n = In. Therefore,

−1 ⊤ ⊤ vec{vecm,n(x)} = vec[{(vec In) ⊗ Im}(In ⊗ x)] = {In ⊗ (vec In) ⊗ In} vec(In ⊗ x) ⊤ = {In ⊗ (vec In) ⊗ In}{(vec In) ⊗ Imn}x ⊤ = ([(In ⊗ vec In){(vec In) ⊗ In}] ⊗ Im)x.

To finish the proof, it suffices to establish the identity

⊤ (In ⊗ vec In){(vec In) ⊗ In} = In. (13)

n ⊤ This can be shown by writing In = i=1 eiei as above, so that P n ⊤ ⊤ ⊤ (In ⊗ vec In){(vec In) ⊗ In} = (In ⊗ ei ⊗ ei )(ej ⊗ ej ⊗ In) i,jX=1 n n ⊤ ⊤ ⊤ = ej ⊗ (ei ej) ⊗ ei = ei ⊗ ei = In, i,jX=1 Xi=1 thus yielding Equation (13).

Proof of Theorem 2. (i) The text preceding the theorem statement establishes this. ⊗r ⊤ ⊗r (ii) Let ai be the ith column of A ∈ Mdr×p, i = 1,...,p. Since D fi(c) u = dr ⊤ ⊗r Rd fi(c; u)= ai u for all u ∈ , then, by a component-wise application of Theorem 1(ii), ⊗r we have D fi(c)= Sd,rai. That is,

Sd,ra1 a1 D⊗r  .   .  f(c)= . = (Ip ⊗ Sd,r) . = (Ip ⊗ Sd,r) vec A.      Sd,rap   ap      −1 The second part of (ii) follows immediately if we set A = vecdr,p a since a = vec A.

A.3 Iterative identification

Proof of Theorem 3. First, consider a real-valued function f : Rd → R. Denote g = r−1 D⊗(r−1)f. Since g : Rd → Rd is a vector-valued function with Dg = D⊗rf, apply- ing Theorem 2(ii) to the first differential of g we have that, if B ∈ Md×dr−1 satisfies dg(c; u)= B⊤u for all u ∈ Rd, then it must be D⊗rf(c)= Dg(c) = vec B.

28 d p For a vector-valued f : R → R , with components f = (f1,...,fp), suppose that d D⊗(r−1) ⊤ Rd ⊤ B ∈ Md×pdr−1 satisfies { f(c; u)} = B u for all u ∈ . If we write B as p ⊤ ⊤ stacked block matrices B1 ,..., Bp ∈ Mdr−1×d, then the previous assumption entails that

d D⊗(r−1) ⊤ ⊤ { f1}(c; u) B1 B1 u  .  ⊗(r−1) ⊤  .   .  . = d{D f}(c; u)= B u = . u = .  ⊗(r−1)   ⊤   ⊤   d{D fp}(c; u)   B   B u     p   p  d D⊗(r−1) ⊤ Rd so that { fi}(c; u)= Bi u for all u ∈ , for i = 1,...,p. By the above argument ⊗r for scalar-valued functions, this implies that D fi(c) = vec Bi for i = 1,...,p; that is,

⊗r D f1(c) vec B1 D⊗r  .   .  f(c)= . = . = vec B1 . . . Bp = vec B.  ⊗r       D fp(c)   vec Bp      A.4 Product and chain rules

Proof of Theorem 4. (i) Since drf(c; u)= D⊗rf(c)⊤u⊗r for all u ∈ Rd, it immediately fol- lows that dr(af)(c; u)= adrf(c; u)= aD⊗rf(c)⊤u⊗r for all u ∈ Rd. So from Theorem 2 ⊗r ⊗r ⊤ ⊗r ⊤ ⊗r its derivative is D (af)(c) = vec{Sd,rD f(c)a } = vec{D f(c)a } = a ⊗ D f(c). −1 D⊗r dr ⊤ ⊗r (ii) Let us write B = vecdr,p f(c) ∈ Mdr ×p so that f(c; u) = B u for all d ⊗r q u ∈ R , which by Theorem 2 implies D f(c) = vec(Sd,rB). For a vector a ∈ R , then it is easy to check that dr(a ⊗ f)(c; u) = a ⊗ drf(c; u) = a ⊗ (B⊤u⊗r) = (a ⊗ B⊤)u⊗r d ⊗r ⊤ ⊤ for all u ∈ R . Therefore, D (a ⊗ f)(c) = vec{Sd,r(a ⊗ B)} = vec{a ⊗ (Sd,rB)} = ⊗r a ⊗ vec(Sd,rB)= a ⊗ D f(c). r r For a matrix A ∈ Mq×p, then it can be shown that d (Af)(c; u) = Ad f(c; u) = ⊤ ⊗r d ⊗r ⊤ AB u for all u ∈ R . Therefore, D (Af)(c) = vec(Sd,rBA ) = (A⊗Idr ) vec(Sd,rB)= ⊗r (A ⊗ Idr )D f(c).

Proof of Theorem 5. (i) Note that for p = q = 1 the Kronecker product coincides with the usual product. Hence, part (i) immediately follows from part (ii). (ii) The first goal is to show that the differential of the Kronecker product satisfies

r r dr(f ⊗ g)(c; u)= dr−jf(c; u) ⊗ djg(c; u), (14) j Xj=0 where it is understood that d0f(c; u) = f(c) and d0g(c; u) = g(c). The proof of this fact follows closely the commonly exhibited reasoning for binomial expansions: this is true for r = 1 since the usual Kronecker product rule for the first differential yields

29 d(f ⊗ g)(c; u)= df(c; u) ⊗ g(c)+ f(c) ⊗ dg(c; u) (Magnus and Neudecker, 2019, p. 164). Then, Equation (14) follows by induction on r. From Equation (14) and Theorem 2(i) we obtain

r r dr(f ⊗ g)(c; u)= [{vec−1 D⊗r−jf(c)}⊤u⊗r−j ⊗ {vec−1 D⊗jg(c)}⊤u⊗j] j dr−j ,p dj ,q Xj=0 r r = [{vec−1 D⊗r−jf(c)} ⊗ {vec−1 D⊗jg(c)}]⊤u⊗r. j dr−j ,p dj ,q Xj=0

The desired formula then follows from Theorem 2(ii).

Proof of Theorem 6. (i) Reasoning as in Spindler (2005), it can be shown that D⊗r(g◦f)(c) depends only on the vectors D⊗kg{f(c)} and D⊗kf(c) for k = 1, . . . , r, so that it suffices to show the theorem statement for any two functions f˜ andg ˜ that share these derivatives with f and g, respectively. Without loss of generality we let c = 0 and f(c) = 0, and we consider f˜(x) = r ⊤ ⊗ℓ r ⊤ ⊗k D⊗ℓ ℓ=1{Ip ⊗ (x ) }vℓ andg ˜(y)= k=1 wk y , for the given vectors vℓ = f(0)/ℓ! ∈ pdℓ ⊗k pk PR and wk = D g(0)/k! ∈ R .P Then, by Corollary 3.1 in Holmquist (1985b), the kth ⊗k ⊗k derivative of f˜ at x = 0 is D f˜(0)= k!(Ip ⊗ Sd,k)vk = D f(0), where the last equality ⊗k ⊗k is due to the fact that Sd,kD fi(0) = D fi(0) for i = 1,...,p. Similarly, by Theorem ⊗k ⊗k 3.1 in Holmquist (1985b) we have D g˜(0)= k!Sp,kwk = D g(0) for k = 1, . . . , r. ˜ ˜ r ⊤ ⊗ℓ If we could expressg ˜ ◦ f as an rth order polynomial (˜g ◦ f)(x) = ℓ=1(x ) bℓ for dℓ ⊗r vectors bℓ ∈ R , then we would have D (˜g ◦ f˜)(0) = r!Sd,rbr. So theP proof would be complete if we could show that we can take the rth coefficient as

r r 1 ⊤ ⊗m b = π k!(w ⊗ I r ) (ℓ!v ) ℓ r r! m k d ℓ Xk=1 mX∈Jk,r Oℓ=1 r r k! r ⊤ ⊗mℓ Rd = (wk ⊗ Idr ) vℓ ∈ . (15) m1! · · · mr! Xk=1 mX∈Jk,r Oℓ=1

In order to show Equation (15), denote by Qk,r the set of partitions of k into r parts; Nr that is, Qk,r = {m ∈ 0 : |m| = k}. Define the symmetrizer matrix in dimension p S with respect to some m ∈ Qk,r as the only matrix p,m ∈ Mpk×pk such that, for any p k! r ⊗mr x ,..., x ∈ R , the product S m x equals the sum over all distinct 1 r m1!···mr! p, ℓ=1 ℓ r terms of type ℓ=1 yℓ for which m1 of the factorsN are equal to x1, m2 of the factors are equal to x2, etc.N Then, using the multinomial expansion in Equation (2.6) of Holmquist

30 (1985b), we obtain

r r r ⊗k ˜ ⊤ ˜ ⊗k ⊤ ⊤ ⊗ℓ (˜g ◦ f)(x)= wk {f(x)} = wk {Ip ⊗ (x ) }vℓ Xk=1 Xk=1 h Xℓ=1 i r r k! ⊤ S ⊤ ⊗ℓ ⊗mℓ = wk p,m [{Ip ⊗ (x ) }vℓ] m1! · · · mr! Xk=1 mX∈Qk,r Oℓ=1 r r k! r ⊤ ⊤ ⊗ Pℓ=1 ℓmℓ ⊗mℓ = wk Ipk ⊗ (x ) vℓ m1! · · · mr! Xk=1 mX∈Qk,r  Oℓ=1 r r k! r ⊤ ⊗ Pℓ ℓmℓ ⊤ ⊗mℓ = (x ) =1 w ⊗ I Pr ℓm v , k d ℓ=1 ℓ ℓ m1! · · · mr! Xk=1 mX∈Qk,r  Oℓ=1 where we used that Sp,mwk = wk by the definition of wk. This last equation shows that g˜ ◦ f˜ is an rth order polynomial in x, where the vector that multiplies (x⊤)⊗r is precisely br in Equation (15), as desired.

(ii) If g has components (g1,...,gq), then g ◦ f has components (g1 ◦ f,...,gq ◦ f), so ⊗r qdr ⊗r ⊗r the vector D (g ◦ f)(c) ∈ R is formed by stacking D (g1 ◦ f)(c),..., D (gq ◦ f)(c). By part (i),

r r ⊗r ⊗|m| ⊤ ⊗ℓ ⊗mℓ d D (gj ◦ f)(c)= πm D gj{f(c)} ⊗ Sd,r {D f(c)} ∈ R mX∈Jr   Oℓ=1 for all j = 1,...,q, so

D⊗|m| f c ⊤ g1{ ( )} r D⊗r  .   D⊗ℓ ⊗mℓ (g ◦ f)(c)= πm . ⊗ Sd,r { f(c)} . mX∈Jr  ⊗|m| ⊤   Oℓ=1  D gq{f(c)}      The proof is completed by noting that

⊗|m| ⊤ D g1{f(c)}  .  D⊗|m| D⊗|m| ⊤ . = g1{f(c)} · · · gq{f(c)}  ⊗|m| ⊤     D gq{f(c)}    −1 ⊗|m| ⊤ = vec m D g{f(c)} . p| |,q   A.5 Vectorized moments E Lemma 3. Suppose that (|Xi1 · · · Xir |) < ∞ for any choice of i1,...,ir ∈ {1,...,d}. Then D⊗rMX(t)= E{exp(t⊤X)X⊗r}.

31 Proof. The condition on the absolute mixed moments implies that we can change the order of differentiation and expectation (see Billingsley, 2012, Section 26). So it suffices to find the derivative of α(t) = exp(t⊤x) = (g ◦ f)(t) where g(t) = exp(t) and f(t)= t⊤x. We have D⊗rg(t) = g(t) and D⊗rg{f(t)} = α(t) for all r. Besides, Df(t) = x and D⊗rf(t)= 0 for all r ≥ 2. Reasoning as for the Hermite polynomial calculation, the only required multi-index, which is a solution of 1 · m1 + · · · + r · mr = r, is m1 = r, m2 =

· · · = mr = 0. Since the π coefficient is πm = r!/r! = 1, then Theorem 6(i) implies that ⊗r ⊗m1 ⊤ ⊗r ⊤ ⊗r D α(t)= α(t)πmSd,r{Df(t)} = exp(t x)Sd,rx = exp(t x)x , as required.

References

Ash, J.M., Gatto, A.E. and V´agi, S. (1990) A multidimensional Taylor’s theorem with minimal hypothesis. Colloquium Mathematicum, 50/51, 245–252.

Avez, A. (1997). Calcul Diff´erentiel. Masson, Paris.

Basilevsky, A. (1983). Applied Matrix Algebra in the Statistical Sciences. Elsevier, New York.

Baxandall, P. and Liebeck, H. (1986). . Clarendon Press, Oxford.

Billingsley, P. (2012) Probability and Measure (Anniversary edition). John Wiley & Sons, Inc., Hoboken.

Brualdi, R.A. (2010) Introductory Combinatorics (5th edition). Prentice-Hall, New Jersey.

Chac´on, J.E. and Duong, T. (2010) Multivariate plug-in bandwidth selection with uncon- strained pilot bandwidth matrices. Test, 19, 375–398.

Chac´on, J.E., Duong, T. and Wand, M.P. (2010) Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica, 21, 807–840.

Chac´on, J.E. and Duong, T. (2015) Efficient recursive algorithms for functionals based on higher order derivatives of the multivariate Gaussian density. Statistics and Computing, 25, 959–974.

Chac´on, J.E. and Duong, T. (2018) Multivariate Kernel Smoothing and Its Applications. Chapman & Hall, Boca Raton.

Constantine, G.M. and Savits, T.H. (1996) A multivariate Faa di Bruno formula with applications. Transactions of the American Mathematical Society, 348, 503–520

Edwards, C. H. (1973) Advanced Calculus of Several Variables. Academic Press, New York.

32 Feller, W. (1968) An Introduction to Probability Theory and Its Applications, Volume I (3rd edition). Wiley & Sons, New York.

Flett, T. M. (1980) Differential Analysis: Differentiation, Differential Equations and Dif- ferential Inequalities. Cambridge University Press, Cambridge.

Graybill, F.A. (1983) Matrices with Applications in Statistics. Wadsworth, Belmont.

Hardy, M. (2006) Combinatorics of partial derivatives. The Electronic Journal of Combi- natorics, 13, R1.

Holmquist, B. (1985a) The direct product permuting matrices. Linear Multilinear Algebra, 17, 117–141.

Holmquist, B. (1985b) Moments and cumulants from generating functions of Hilbert space- valued random variables and an application to the Wishart distribution. Statistical Research Report 1985:3, University of Lund.

Holmquist, B. (1988) Moments and cumulants of the multivariate normal distribution. Stochastic Analysis and Applications, 6, 273–278.

Holmquist, B. (1996) The d-variate vector Hermite polynomial of order k. Linear Algebra and Its Applications 237/238, 155–190.

Kollo, T. and von Rosen, D. (2005) Advanced Multivariate Statistics with Matrices. Springer, Dordrecht.

Magnus, J. (2010) On the concept of matrix derivative. Journal of Multivariate Analysis 101, 2200–2206.

Magnus, J.R. and Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics (3rd edition). John Wiley & Sons, Chichester.

Ruppert, D. and Wand, M.P. (1994) Multivariate locally weighted least squares regression. Annals of Statistics, 22, 1346–1370.

Pollock, D.S.G. (1985). Tensor products and matrix differential calculus. Linear Algebra and Its Applications 67, 169–193.

Schott, J.R. (2003). Kronecker product permutation matrices and their application to moment matrices of the normal distribution. Journal of Multivariate Analysis 87, 177– 190.

33 Schott, J.R. (2017). Matrix Analysis for Statistics (3rd edition). John Wiley & Sons, New Jersey.

Searle, S.R. (1982) Matrix Algebra Useful for Statistics. Wiley, New York.

Spindler, K. (2005). A short proof of the formula of Fa´adi Bruno. Elemente der Mathematik 60, 33–35.

Voinov, V.G. and Nikulin, M.S. (1997) On a subset sum algorithm and its probabilistic and other applications. In: Balakrishnan, N. (ed) Advances in Combinatorial Methods and Applications to Probability and Statistics, pp. 153–163. Birkh¨auser, Boston.

Wand, M.P. (2002) Vector differential calculus in statistics. American Statistician, 56, 55–62.

Wand, M.P. and Jones, M.C. (1993) Comparison of smoothing parametrizations in bi- variate kernel density estimation. Journal of the American Statistical Association 88, 520–528.

Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. Chapman & Hall, London.

Wheeden, R.L. and Zygmund, A. (1977) Measure and Integral: An Introduction to Real Analysis. Marcel Dekker, New York.

34