<<

Introduction The Kronecker and Box Product Differentiation Optimization

Efficient Automatic Differentiation of Matrix Functions

Peder A. Olsen, Steven J. Rennie, and Vaibhava Goel IBM T.J. Watson Research Center

Automatic Differentiation 2012 Fort Collins, CO

July 25, 2012 watsonlogo

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions How can we compute the ?

Numerical Methods that do numeric differentiation by formulas like the finite difference f (x + h) − f (x) f 0(x) ≈ h These methods are simple to program, but loose half of all significant digits. Symbolic Symbolic differentiation gives the symbolic representation of the derivative without saying how best to implement the derivative. In general symbolic derivatives are as accurate as they come. Automatic Something between numeric and symbolic differentiation. The derivative implementation is computed from the code for f (x).

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions

Finite Difference: f (x + h) − f (x) f 0(x) ≈ h

Central Difference: f (x + h) − f (x − h) f 0(x) ≈ 2h

For h < x, where  is the machine precision, these approximations become zero. Thus h has to be chosen with care.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions

The error in the finite difference is bounded by

h 2kf k kf 00k + ∞ ∞ 2 h whose minimum is p 00 2 kf k∞kf k∞ achieved for s kf k∞ h = 2 00 . kf k∞

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Numerical Differentiation Error

The graph shows the error in approximating the derivative for f (x) = ex as a function of h.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions An imaginary derivative

The error stemming from the subtractive cancellation can be completely removed by using complex numbers. Let f : R → R then

(ih)2 (ih)3 f 0(x + ih) ≈ f (x) + ihf 0(x) + f 00(x) + f (3)(x) 2! 3! h2 h2 = f (x) − f 00(x) + hi(f 0(x) − f (3)(x)) 2 6

Thus the derivative can be approximated by

f 0(x + ih) f 0(x) ≈ Im h

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Imaginary Differentiation Error

The graph shows the error in approximating the derivative for f (x) = ex as a function of h.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions

By using dual numbers dual numbers a + bE, where E 2 = 0 we can further reduce the error. For second order derivatives we can use hyper dual numbers.((2012) J. Fike). implementations usually come with most programming languages. For dual or hyper-dual numbers we need to download or implement code.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions What is Automatic Differentiation?

Algorithmic Differentiation (often referred to as Automatic Differentiation or just AD) uses the software representation of a function to obtain an efficient method for calculating its derivatives. These derivatives can be of arbitrary order and are analytic in nature (do not have any truncation error). B. Bell – author of CppAD

Use of dual or complex numbers is a form of automatic differentiation. More common though is operator overloading implementations as in CppAD.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions The Speelpenning function is: n Y f (y) = xi (y) i=1 Qn We consider xi = (y − ti ), so that f (y) = i=1(y − ti ). If we compute the symbolic derivative we get n 0 X Y f (y) = (y − tj ), i=1 j6=i which requires n(n − 1) multiplications to compute with a straightforward implementation. Rewriting the expression as n X f (y) f 0(y) = , y − ti i=1 reduces the computational cost, but is not numerically stable if y ≈ ti . Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Reverse Mode Differentiation

Qk Qn Define fk = i=1(y − ti ) and rk = i=k (y − ti ), then

n n 0 X Y X f (y) = (y − tj ) = fi−1ri+1, i=1 j6=i i=1

and fi = (y − ti )fi−1, f1 = y − t1, ri = (y − ti )ri+1, rn = y − tn. Note that f (y) = fn = r1. Cost of computing f (y) is n subtractions and n − 1 multiplies. At the overhead of storing f1,..., fn the derivative can be computed with an additional n − 1 additions and2 n − 1 multiplications.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Reverse Mode Differentiation

Under quite realistic assumptions the evaluation of a gradient requires never more than five times the effort of evaluating the underlying function by itself. Andreas Griewank (1988)

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions

Reverse mode differentiation (1971) is essentially applying the chain rule on a function in the reverse order that the function is computed. This is the same technique as applied in the backpropagation algorithm (1969) for neural networks and the forward-backward algorithm for training HMMs (1970). Reverse mode differentiation applies to very general classes of functions.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions

Reverse mode differentiation was independently invented by G. M. Ostrovskii (1971) S. Linnainmaa (1976) John Larson and Ahmed Sameh (1978) Bert Speelpenning (1980) Not surprisingly the fame went to B. Speelpenning, who admittedly told the full story at AD 2012. A. Sameh was on B. Speelpenning’s thesis committee. The Speelpenning function was suggested as a motivation to Speelpenning’s thesis by his Canadian interim advisor Prof. Art Sedgwick who passed away on July 5th.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Why worry about matrix differentiation?

The differentiation of functions of a matrix argument is still not well understood No simple “calculus” for computing matrix derivatives. Matrix derivatives can be complicated to compute even for quite simple expressions. Goal: Build on known results to define a matrix-derivative calculus.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Quiz

The Tikhonov regularized maximum (log-)likelihood for learning the covariance Σ given an empirical covariance S is

−1 −1 2 f (Σ) = −log det(Σ) − trace(SΣ ) − kΣ kF .

What is f 0(Σ) and f 00(Σ)? Statisticians can compute it. Can you? Can a calculus be reverse engineered?

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Terminology

Classical calculus: -scalar functions( f : R → R) d scalar-vector functions (f : R → R). : m×n scalar-matrix functions( f : R → R) m×n k×l matrix-matrix functions( f : R → R ). Note: 1 Derivatives of scalar-matrix functions requires derivatives of matrix-matrix functions (why?) 2 Matrix-matrix derivatives should be computed implicitly wherever possible (why?)

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Optimizing Scalar-Matrix Functions

Useful quantities for optimizing a scalar-matrix function

d×d f (X): R → R The gradient is a matrix–matrix function:

0 d×d d×d f (X): R → R .

The Hessian is also a matrix-matrix function:

00 d×d d2×d2 f (X): R → R .

Desiderata: The derivative of a matrix–matrix function should be a matrix, so that we can directly use it to compute the Hessian.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Optimizing Scalar-Matrix Functions (continued)

Taking the scalar–matrix derivative of

f (G(X))

will require the information in the matrix–matrix derivative ∂G . ∂X Desiderata: The derivative of a matrix-matrix function should be a matrix, so that a convenient chain-rule can be established.

Peder, Steven and Vaibhava Matrix Differentiation Note that the matrix–matrix derivative of a scalar–matrix function is not the same as the scalar–matrix derivative: ! ∂ mat(f )  ∂f > = vec> . ∂X ∂X

Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:   ∂f11 ∂f11 ··· ∂f11 ∂x11 ∂x12 ∂xmn >  ∂f12 ∂f12 ··· ∂f12  ∂F def ∂vec(F )  ∂x11 ∂x12 ∂xmn  = =  . . . .  . ∂X ∂vec>(X>)  . . .. .   . . .  ∂fkl ∂fkl ··· ∂fkl ∂x11 ∂x12 ∂xmn

Peder, Steven and Vaibhava Matrix Differentiation Note that the matrix–matrix derivative of a scalar–matrix function is not the same as the scalar–matrix derivative: ! ∂ mat(f )  ∂f > = vec> . ∂X ∂X

Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:   ∂f11 ∂f11 ··· ∂f11 ∂x11 ∂x12 ∂xmn >  ∂f12 ∂f12 ··· ∂f12  ∂F def ∂vec(F )  ∂x11 ∂x12 ∂xmn  = =  . . . .  . ∂X ∂vec>(X>)  . . .. .   . . .  ∂fkl ∂fkl ··· ∂fkl ∂x11 ∂x12 ∂xmn

Peder, Steven and Vaibhava Matrix Differentiation Note that the matrix–matrix derivative of a scalar–matrix function is not the same as the scalar–matrix derivative: ! ∂ mat(f )  ∂f > = vec> . ∂X ∂X

Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:   ∂f11 ∂f11 ··· ∂f11 ∂x11 ∂x12 ∂xmn >  ∂f12 ∂f12 ··· ∂f12  ∂F def ∂vec(F )  ∂x11 ∂x12 ∂xmn  = =  . . . .  . ∂X ∂vec>(X>)  . . .. .   . . .  ∂fkl ∂fkl ··· ∂fkl ∂x11 ∂x12 ∂xmn

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions The Matrix-Matrix Derivative

We define the matrix–matrix derivative to be:   ∂f11 ∂f11 ··· ∂f11 ∂x11 ∂x12 ∂xmn >  ∂f12 ∂f12 ··· ∂f12  ∂F def ∂vec(F )  ∂x11 ∂x12 ∂xmn  = =  . . . .  . ∂X ∂vec>(X>)  . . .. .   . . .  ∂fkl ∂fkl ··· ∂fkl ∂x11 ∂x12 ∂xmn Note that the matrix–matrix derivative of a scalar–matrix function is not the same as the scalar–matrix derivative: ! ∂ mat(f )  ∂f > = vec> . ∂X ∂X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Computing Derivatives The Kronecker and Box Product Automatic Differentiation Matrix Differentiation Matrix-Matrix Derivatives Optimization Linear Matrix Functions Derivatives of Linear Matrix Functions

1 A matrix–matrix derivative is a matrix outer-product: ∂F F (X) = trace(AX)B, = vec(B>)vec>(A). ∂X

2 A matrix–matrix derivative is a : ∂F F (X) = AXB, = A ⊗ B>. ∂X

3 A matrix–matrix derivative cannot be expressed with standard operations. We call the new operation the box product ∂F F (X) = AX>B, = A B>. ∂X 

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform Direct Matrix Products

A Direct Matrix Product, X = A ~ B, is a matrix whose elements are

x(i1i2)(i3i4) = aiσ(1)iσ(2) biσ(3)iσ(4) .

(i1i2) is shorthand for i1n2 + i2 − 1 where i2 ∈ {1, 2,..., n2}. σ is a permutation over Z4 The direct matrix products are central to matrix–matrix differentiation. 8 can be expressed using Kronecker products:

A ⊗ BA> ⊗ BA ⊗ B> A> ⊗ B> B ⊗ AB> ⊗ AB ⊗ A> B> ⊗ A>

Then 8 using matrix outer products and the final 8 using the box product.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform Definition

m ×n m ×n Let A ∈ R 1 1 and B ∈ R 2 2 Definition (Kronecker Product)

(m1m2)×(n1n2) A ⊗ B ∈ R is defined by( A ⊗ B)(i−1)m2+j,(k−1)n2+l = aik bjl = (A ⊗ B)(ij)(kl).

Definition (Box Product)

(m1m2)×(n1n2) A  B ∈ R is defined by( A  B)(i−1)m2+j,(k−1)n1+l = ail bjk = (A  B)(ij)(kl).

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform An example Kronecker and box product

To see the structure consider the box product of two2 × 2 matrices, A and B:

A ⊗ BA  B     a11b11 a11b12 a12b11 a12b12 a11b11 a12b11 a11b12 a12b12 a11b21 a11b22 a12b21 a12b22 a11b21 a12b21 a11b22 a12b22     a21b11 a21b12 a22b11 a22b12 a21b11 a22b11 a21b12 a22b12 a21b21 a21b22 a22b21 a22b22 a21b21 a22b21 a21b22 a22b22

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

1 2 3 ··· 10  1 1 1 ··· 1  1 2 3 ··· 10  2 2 2 ··· 2      1 2 3 ··· 10  3 3 3 ··· 3  B =   A =   . . . .   . . . .  . . . .   . . . .  . . . .  10 10 10 ··· 10 1 2 3 ··· 10

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

A ⊗ B

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

A  B

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

vec(A)vec>(B)

Peder, Steven and Vaibhava Matrix Differentiation 2 :

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)(A  B)(C  D) = (AD) ⊗ (BC)

3 Inverse and : −1 −1 −1 −1 −1 −1 (A ⊗ B) = A ⊗ B (A  B) = B  A > > > > > > (A ⊗ B) = A ⊗ B (A  B) = B  A

4 Mixed Products:

(A ⊗ B)(C  D) = (AC)  (BD)(A  B)(C ⊗ D) = (AD)  (BC)

Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform The box product behaves similarly to the Kronecker product: 1 Vector Multiplication:

> > > (B ⊗ A)vec(X) = vec(AXB)(B  A)vec(X) = vec(AX B)

Peder, Steven and Vaibhava Matrix Differentiation 3 Inverse and Transpose: −1 −1 −1 −1 −1 −1 (A ⊗ B) = A ⊗ B (A  B) = B  A > > > > > > (A ⊗ B) = A ⊗ B (A  B) = B  A

4 Mixed Products:

(A ⊗ B)(C  D) = (AC)  (BD)(A  B)(C ⊗ D) = (AD)  (BC)

Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform The box product behaves similarly to the Kronecker product: 1 Vector Multiplication:

> > > (B ⊗ A)vec(X) = vec(AXB)(B  A)vec(X) = vec(AX B)

2 Matrix Multiplication:

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)(A  B)(C  D) = (AD) ⊗ (BC)

Peder, Steven and Vaibhava Matrix Differentiation 4 Mixed Products:

(A ⊗ B)(C  D) = (AC)  (BD)(A  B)(C ⊗ D) = (AD)  (BC)

Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform The box product behaves similarly to the Kronecker product: 1 Vector Multiplication:

> > > (B ⊗ A)vec(X) = vec(AXB)(B  A)vec(X) = vec(AX B)

2 Matrix Multiplication:

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)(A  B)(C  D) = (AD) ⊗ (BC)

3 Inverse and Transpose: −1 −1 −1 −1 −1 −1 (A ⊗ B) = A ⊗ B (A  B) = B  A > > > > > > (A ⊗ B) = A ⊗ B (A  B) = B  A

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform The box product behaves similarly to the Kronecker product: 1 Vector Multiplication:

> > > (B ⊗ A)vec(X) = vec(AXB)(B  A)vec(X) = vec(AX B)

2 Matrix Multiplication:

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)(A  B)(C  D) = (AD) ⊗ (BC)

3 Inverse and Transpose: −1 −1 −1 −1 −1 −1 (A ⊗ B) = A ⊗ B (A  B) = B  A > > > > > > (A ⊗ B) = A ⊗ B (A  B) = B  A

4 Mixed Products:

(A ⊗ B)(C  D) = (AC)  (BD)(A  B)(C ⊗ D) = (AD)  (BC)

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform m ×n m ×n Let A ∈ R 1 1 , B ∈ R 2 2 then Trace:

trace(A⊗B) = trace(A) trace(B) trace(AB) = trace(AB)

Determinant: Here m1 = n1 and m2 = n2 is required det(A ⊗ B) = (det(A))m2 (det(A))m1 m1 m2 ( )( ) m2 m1 det(A  B) =( −1) 2 2 (det(A)) (det(B))

Associativity:

(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)(A  B)  C = A  (B  C), but not for mixed products. In general we have

(A ⊗ B)  C 6= A ⊗ (B  C)(A  B) ⊗ C 6= A  (B ⊗ C).

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform Identity Box Products

The identity box product Im  In is with m ×n m ×n interesting properties. Let A ∈ R 1 1 , B ∈ R 2 2 . > Orthonormal:( Im  In) (Im  In) = Imn > Transposition:( Im1  In1 )vec(A) = vec(A ) Connector: Converting a Kronecker product to a box product:

(A ⊗ B)(In1  In2 ) = A  B. Converting a box product to a Kronecker product:

(A  B)(In2  In1 ) = A ⊗ B.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform Box products are old things in a new wrapping

Although the notation for box products is new, Im  In has long mn been known as Tm,n by physicists, or as a stride permutation Lm by others. These objects are identical, but the box-product allows us to express more complex identities more compactly, e.g. let m ×n m ×n A ∈ R 1 1 , B ∈ R 2 2 then

> > (Im1  In1m2  In2 )vec((A ⊗ B) ) = vec(vec(A)vec (B)). The initial permutation matrix would have to be written

Im1  In1m2  In2 = (Im1 ⊗ Tn1m2,m1 )Tm1,n1m1m2 , if the notation of box-products wasn’t being used.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

Direct matrix products are important because such matrices can be multiplied fast with each other and with vectors. Let us show how the FFT can be done in terms of Kronecker and box products. Recall that the DFT matrix of order n is given by −2πikl/n kl Fn = [e ]0≤k,l

1 1 1 1    1 1 1 −i −1 i  F2 = , F4 =   1 −1 1 −1 1 −1 1 i −1 −i

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

We can factor the matrix F4 as follows:

1 1  1  1 1  1   1 1   1  1 −1   1  F4 =         1 −1   1   1 1   1  1 −1 −i 1 −1 1

or more compactly

 1 1  F = (F ⊗ I )diag vec (I F ) 4 2 2 1 −i 2  2

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

In general if we define the matrix

1 1 1 ... 1  1 α α2 . . . αM−1    1 α2 α4 . . . α2(M−1)  VN,M (α) =   , ......  . . . . .  1 αN−1 α2(N−1) . . . α(N−1)(M−1)

For N = km we have the following factorizations of the DFT matrix:

FN =( Fk ⊗ Im)(diag(vec(Vm,k (ωN ))))(Ik  Fm) FN = (Fm  Ik )(diag(vec(Vm,k (ωN ))))(Fk ⊗ Im)

Peder, Steven and Vaibhava Matrix Differentiation Introduction Definition The Kronecker and Box Product Properties Matrix Differentiation Identity Box Products Optimization The Fast Fourier Transform

This allows FFTN (x) = y = FN x to be computed as

> > FN x = vec((Vm,k (ωN ) ◦ (FmX ))Fk ). where x = vec(X) and ◦ denotes elementwise multiplication (.∗ in Matlab). The direct computation has a cost of O(k2m2), whereas the formula above does the job in O(km(k + m)) operations. Repeated use of the identity for N = 2n leads to the Cooley-Tukey FFT algorithm. (James Cooley was at T. J. Watson from 1962 to 1991). The fastest FFT library in the world as of 2012 (Spiral) uses knowledge of several such factorizations to automatically optimize the FFT implementation for an arbitrary value of n to a given platform!

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Optimization Forward and Reverse Mode Differentiation

Here we show the role of Kronecker and box product in the matrix–matrix differentiation framework.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Basic Differentiation Identitities

m×n Let X ∈ R . 0 Identity: F(X) = X, F (X) = Imn > 0 Transpose:F(X) = X , F (X) = Im  In Chain Rule:

F(X) = G(H(X)), F0(X) = G0(H(X))H0(X)

Product Rule:

F(X) = G(X)H(X), F0(X) = (I⊗H>(X))G0(X)+(G(X)⊗I)H0(X)

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation More Derivative Identitities

m×m Assume X ∈ R is a . 2 0 > Square:F (X) = X , F (X) = Im ⊗ X + X ⊗ Im. Inverse:F (X) = X−1, F0(X) = −X−1 ⊗ X−>. −> 0 −> −1 +Transpose:F (X) = X , F (X) = −X  X . Square Root:F (X) = X1/2, −1 F0(X) = I ⊗ (X1/2)> + X1/2 ⊗ I . k 0 Pk−1 i k−1−i > Integer Power:F (X) = X , F (X) = i=0 X ⊗ (X ) .

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

First some simple identities:

f (X) = trace(AX) f 0(X) = A> f (X) = trace(AX>) f 0(X) = A f (X) = log det(X) f 0(X) = X−>

Then the chain rule for more general expressions: !  ∂ > ∂G>   vec log det(G(X)) = vec (G(X))−1 ∂X ∂X !  ∂ > ∂G> vec trace(G(X)) = vec(I) ∂X ∂X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

The chain rule to get scalar-matrix derivatives is awkward to use. Instead we have some short-cuts.

∂ ∂ trace(G(X)H(X)) = trace (H(Y)G(X) + H(X)G(Y)) , ∂X ∂X Y=X

∂ −1 ∂ −1 −1  trace(AF (X)) = − trace F (Y)AF (Y)F(X) , ∂X ∂X Y=X and

∂log det(F(X)) ∂ −1  = trace F (Y)F(X) . ∂X ∂X Y=X

Peder, Steven and Vaibhava Matrix Differentiation The answer is: (X − C)>(X − B)−> + (X − B)−>(X − A)> +(X − B)−>(X − A)>(X − C)>(X − B)−>

Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Use the formulas on the previous page to compute ∂ trace((X − A)(X − B)−1(X − C)) ∂X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Use the formulas on the previous page to compute ∂ trace((X − A)(X − B)−1(X − C)) ∂X The answer is: (X − C)>(X − B)−> + (X − B)−>(X − A)> +(X − B)−>(X − A)>(X − C)>(X − B)−>

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Finally if Pn i q(x) i=1 ai x r(x) = = Pn i p(x) j=1 bi x is a scalar-scalar function we can form a matrix-matrix function by simply substituting a matrix X for the scalar x. Then ∂ trace(r(X)) = r 0(X)> ∂X and if h(x) = log(r(x)) then

∂ log det(r(X)) = h0(X)> . ∂X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1 (·)>

+ X

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

(·)−1 (·)>

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

−1 −1 > T2 = T1 (·) (·)

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

× X

−1 −1 > > T2 = T1 (·) T3 = X (·)

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

×

T4 = T2 ∗ T3 × X

−1 −1 > > T2 = T1 (·) T3 = X (·)

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

trace(F(X))

T5 = T4 ∗ X ×

T4 = T2 ∗ T3 × X

−1 −1 > > T2 = T1 (·) T3 = X (·)

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)

f = trace(T5) trace(F(X))

T5 = T4 ∗ X ×

T4 = T2 ∗ T3 × X

−1 −1 > > T2 = T1 (·) T3 = X (·)

X T1 = I + X +

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Forward mode computation of f 0(X):

trace(F(X))

× T5

× T4 I ⊗ I X

−1 > (·) T2 T3 (·)

I ⊗ I X + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

(G + H)0 = G0 + H0

trace(F(X))

× T5

× T4 I ⊗ I X

−1 > (·) T2 T3 (·)

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

−1 0 −1 −> 0 Chain rule:( T1 ) = −(T1 ⊗ T1 )T1

trace(F(X))

× T5

× T4 I ⊗ I X

0 > 0 −1 > T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·)

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> 0 (X ) = I  I

trace(F(X))

× T5

× T4 I ⊗ I X

0 > 0 −1 > T0 = I I T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·) 3 

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Product rule:( GH)0 = (I ⊗ H>)G0 + (G ⊗ I)H0

trace(F(X))

× T5

0 > 0 0 T4 = (I ⊗ T3 )T2 + (T2 ⊗ I)T3 × T4 I ⊗ I X

0 > 0 −1 > T0 = I I T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·) 3 

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Product rule:( GH)0 = (I ⊗ H>)G0 + (G ⊗ I)H0

trace(F(X))

0 > 0 T5 = (I ⊗ X )T4 + (T4 ⊗ I)(I ⊗ I) × T5

0 > 0 0 T4 = (I ⊗ T3 )T2 + (T2 ⊗ I)T3 × T4 I ⊗ I X

0 > 0 −1 > T0 = I I T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·) 3 

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

((trace(F))0)> = vec−1 vec>(I)F0

0 −1 > 0  f = vec vec (I)T5 trace(F(X))

0 > 0 T5 = (I ⊗ X )T4 + (T4 ⊗ I)(I ⊗ I) × T5

0 > 0 0 T4 = (I ⊗ T3 )T2 + (T2 ⊗ I)T3 × T4 I ⊗ I X

0 > 0 −1 > T0 = I I T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·) 3 

0 I ⊗ I X T1 = 0 + I ⊗ I + T1

0 I I ⊗ I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation Forward mode differentiation Requires huge intermediate matrices. Only the last step reduces the size. Critical points: F0(X) is composed of Kronecker, box (and outer) products for a large class of functions. (wow!) These can be “unwound” by multiplication with vectorized scalar-matrix derivatives (gasp!):

vec>(C)A ⊗ B = vec>(B>CA)

Reverse mode differentiation: Evaluate the derivative from top to bottom. Small scalar-matrix derivatives are propagated down the tree.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

Reverse Mode Differentiation:( f 0)> = vec−1 vec>(I)F0

0 f = trace(F(X)) R0 = I

× T5

× T4 X

−1 > (·) T2 T3 (·)

X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > > vec (R0)(I ⊗ X ) = vec (XR0I)

0 f = trace(F(X)) R0 = I

0 > 0 T5 =( I ⊗ X ) T4+ T4 ⊗ I × T5

R1 = XR0

× T4 X

−1 > (·) T2 T3 (·)

X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > vec (R0)(T4 ⊗ I) = vec (IR0T4)

0 > f = R2 trace(F(X)) R0 = I

0 > 0 T5 =( I ⊗ X ) T4+ T4 ⊗ I × T5

R1 = XR0

× T4 X R2 = R0T4

−1 > (·) T2 T3 (·)

X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > > vec (R1)(I ⊗ T3 ) = vec (T3R1I)

0 > f = R2 trace(F(X)) R0 = I

× T5

R1 = XR0 0 > 0 0 T4 =( I ⊗ T3 ) T2+( T2 ⊗ I) T3 × T4 X R2 = R0T4

R3 = T3R1

−1 > (·) T2 T3 (·)

X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > vec (R1)(T2 ⊗ I) = vec (IR1T2)

0 > f = R2 trace(F(X)) R0 = I

× T5

R1 = XR0 0 > 0 0 T4 =( I ⊗ T3 ) T2+( T2 ⊗ I) T3 × T4 X R2 = R0T4

R3 = T3R1 R4 = R1T2

−1 > (·) T2 T3 (·)

X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > > vec (R3)(−T2 ⊗ T2 ) = vec (−T2R3T2)

0 > f = R2 trace(F(X)) R0 = I

× T5

× T4 X R2 = R0T4

R3 = T3R1 R4 = R1T2

0 > 0 −1 > T2 = −(T2 ⊗ T2 )T1 (·) T2 T3 (·)

R5 = −T2R3T2 X + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> > > vec (R4)(I  I) = vec (IR4 I)

0 > > f = R2 +R6 trace(F(X)) R0 = I

× T5

× T4 X R2 = R0T4

R4 = R1T2

−1 > 0 (·) T2 T3 (·) T3 = I  I

R5 = −T2R3T2 > X R6 = R4 + T1

I X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> vec (R5)0 = vec(0)

0 > > f = R2 +R6 trace(F(X)) R0 = I

× T5

× T4 X R2 = R0T4

−1 > (·) T2 T3 (·)

R5 = −T2R3T2 > 0 X R6 = R4 T1 =0+I ⊗ I + T1

I R7 = 0 X

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

> vec (R5)I ⊗ I = vec(R5)

0 > > > f = R2 +R6 +R8 trace(F(X)) R0 = I

× T5

× T4 X R2 = R0T4

−1 > (·) T2 T3 (·)

R5 = −T2R3T2 > 0 X R6 = R4 T1 =0+I ⊗ I + T1

I R7 = 0 X R8 = R5

Peder, Steven and Vaibhava Matrix Differentiation Introduction Product and Chain Rule The Kronecker and Box Product Matrix Powers Matrix Differentiation Trace and Log Determinant Optimization Forward and Reverse Mode Differentiation

0 > > > f (X) = R2 + R6 + R8

0 > > > f = R2 +R6 +R8 trace(F(X)) R0 = I

× T5

× T4 X R2 = R0T4

−1 > (·) T2 T3 (·)

> X R6 = R4 + T1

I X R8 = R5

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work

There is growing interest in Machine Learning in scalar–matrix objective functions. Probabilistic Graphical Models Covariance Selection Optimization in Graphs and Networks. Data-mining in social networks Can we use this theory to help optimize such functions? We are on the look-out for interesting problems.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work The Anatomy of a Matrix-Matrix Function

Theorem Let R(X) be a rational matrix–matrix function formed from constant matrices andK occurrences of X using arithmetic matrix operations (+, −, ∗ and (·)−1) and transposition ((·)>). Then the derivative of the matrix–matrix function is of the form

k K X1 X Ai ⊗ Bi + Ai  Bi . i=1 i=k1+1

The matrices Ai and Bi are computed as parts of R(X).

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work Hessian Forms

The derivative of a function f of the form trace(R(X)) or log det(R(X)) is a rational function. Therefore, the Hessian is of the form k K X1 X Ai ⊗ Bi + Ai  Bi . i=1 i=k1+1 where K is the number of times X occurs in the expression for the derivative. If Ai , Bi are d × d matrices then multiplication by H can be done with O(Kd3) operations. Generalizations of this result can be found in the paper.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work

d×d Let f : R → R with f (X) = trace(R(X)) or f (X) = log det(R(X)). 0 d×d Derivative: G = f (X) ∈ R and Hessian: 00 d2×d2 H = f (X) ∈ R . Newton direction vec(V) = H−1vec(G) can be computed efficiently if K  d

K Algorithm Complexity Storage 1 A ⊗ B = A−1 ⊗ B−1 O(d3) O(Kd2) 2 Barthels-Stewart algorithm O(d3) O(Kd2) ≥ 3 Conjugate Gradient Algorithm O(Kd5) O(Kd2) ≥ d General matrix inversion O(d6 + Kd4) O(d4)

The (generalized) Bartels-Stewart algorithms solves the Sylvester-like equation AX + X>B = C.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work

What about optimizing functions of the form

f (X) + kvec(X)k1?

Common approaches: Strategy method Use Hessian structure? Newton-Lasso Coordinate descent X FISTA X Orthantwise `1 CG X L-BFGS  It is not obvious how to take advantage of the Hessian structure for all these methods. For orthantwise CG for example – the sub-problem requires the Newton direction for a sub-matrix of the Hessian.

Peder, Steven and Vaibhava Matrix Differentiation Introduction Derivative Patterns The Kronecker and Box Product Hessian Forms Matrix Differentiation Newton’s Method Optimization Future Work

1 We have applied this methodology to the covariance selection problem – a paper is forthcoming. 2 We are digging deeper in the theory of matrix differentiation and properties of the box products. 3 We are on the look-out for more interesting matrix optimization problems. All suggestions are appreciated!

Peder, Steven and Vaibhava Matrix Differentiation