The Kronecker Product SVD

Charles Van Loan

October 19, 2009 The Kronecker Product

B C is a block matrix whose ij-th block is b C. ⊗ ij E.g.,

b b b C b C 11 12 C = 11 12 b21 b22 ⊗ b21C b22C

Replicated Block Structure The KP-SVD

If A A N 11 ··· 1 A =  . ... .  A IRp q ij ∈ × AM AMN  1 ···  then there exists a positive integer rA with rA MN so that rA ≤

A = σ B C rA = rankKP(A) k k ⊗ k Xk=1

The KP-singular values: σ σrA > 0. 1 ≥ ··· ≥ M N p q The Bk IR × and Ck IR × satisfy = δij and ∈ ∈ T < Ci, Cj >= δij where = trace(F G). Nearness Property

Let r be a positive integer that satisﬁes r rA. The problem ≤

min A X k − kF rankKP(X) = r is solved by setting

r X(opt) = σ B C . k k ⊗ k Xk=1 Talk Outline

1. Survey of Essential KP Properties Just enough to get through the talk. 2. Computing the KP-SVD It’s an SVD computation. 3. Nearest KP Preconditioners Solving KP Systems is fast. 4. Some Constrained Nearest KP Problems Nearest (Markov) (Markov) ⊗ 5. Multilinear Connections A low-rank approximation of a 4-dimensional tensor 6. Oﬀ-The-Wall / Just-For-Fun Computing log(det(A)) for Large Sparse Pos Def A Essential KP Properties Every bijckl Shows Up

c c c b b 11 12 13 11 12  c c c  b b ⊗ 21 22 23 21 22 c c c  31 32 33  =

b11c11 b11c12 b11c13 b12c11 b12c12 b12c13   b11c21 b11c22 b11c23 b12c21 b12c22 b12c23  b c b c b c b c b c b c   11 31 11 32 11 33 12 31 12 32 12 33     b c b c b c b c b c b c   21 11 21 12 21 13 22 11 22 12 22 13     b21c21 b21c22 b21c23 b22c21 b22c22 b22c23     b21c31 b21c32 b21c33 b22c31 b22c32 b22c33    Hierarchical

c c c c 11 12 13 14 d d d b b  c c c c  11 12 13 A = 11 12 21 22 23 24  d d d  b b ⊗ c c c c ⊗ 21 22 23 21 22  31 32 33 34  d d d  c c c c   31 32 33   41 42 43 44 

A is a 2-by-2 block matrix whose entries are 4-by-4 block matrices whose entries are 3-by-3 matrices. Algebra

(B C)T = BT CT ⊗ ⊗ (B C) 1 = B 1 C 1 ⊗ − − ⊗ − (B C)(D F ) = BD CF ⊗ ⊗ ⊗ B (C D) =(B C) D ⊗ ⊗ ⊗ ⊗

No: B C = C B ⊗ 6 ⊗ Yes: B C = (Perfect Shuﬄe)(C B)(Perfect Shuﬄe)T ⊗ ⊗ The vec Operation

Turns matrices into vectors by stacking columns: 1 2 1 10   3 X =  2 20  vec(X) =   ⇒  10  3 30      20     30    Important special case: 1 1 1 vec(rank-1 matrix) = vec  2  1 10  =  2  10 ⊗ 3 3      Reshaping

The matrix equation

Y = CXBT can be reshaped into a vector equation

vec(Y )=(B C)vec(X) ⊗ Implies fast linear equation solving and fast matrix-vector multi- plication. (More later.) Inheriting Structure

nonsingular nonsingular  lower(upper) triangular   lower(upper)triangular       banded   block banded       symmetric   symmetric      If B and C are  positive deﬁnite  then B C is  positive deﬁnite       stochastic  ⊗  stochastic  Toeplitz block Toeplitz          permutations   a permutation       orthogonal   orthogonal          Computing the KP-SVD Warm-Up: The Nearest KP Problem

Given A IRm n with m = m m and n = n n . ∈ × 1 2 1 2

Find B IRm1 n1 and C IRm2 n2 so ∈ × ∈ ×

φ(B, C) = A B C = min k − ⊗ kF

A bilinear least squares problem.

Fix B (or C) and it becomes linear in C (or B). Reshaping the Nearest KP Problem

a11 a12 a13 a14  a21 a22 a23 a24  b11 b12 a31 a32 a33 a34 c11 c12 φ(B, C)=    b21 b22   a41 a42 a43 a44  − ⊗ c21 c22   b31 b32  a51 a52 a53 a54       a61 a62 a63 a64    F

a11 a21 a12 a22 b11  a31 a41 a32 a42   b21 

a51 a61 a52 a62 b31 =     c11 c21 c12 c22  a13 a23 a14 a24  −  b12       a33 a43 a34 a44   b22       a53 a63 a54 a64   b32      F

!!! Finding the nearest rank-1 matrix is an SVD problem !!! SVD Primer

m n T A IR U AV = Σ=diag(σ ,...,σn) ∈ × ⇒ 1

If U =[u u um] and V =[v v vn] then 1 | 2 | ···| 1 | 2 | ···| The rank-1 matrix σ u vT solves • 1 1 1 min A A˜ F k − k rank(A˜) = 1

T v1 is the dominant eigenvector for A A: • T 2 T A Av1 = σ1v1 Av1 = σ1u1 σ1 = u1 Av1

T u1 is the dominant eigenvector for AA : • T 2 T T T AA u1 = σ1u1 A u1 = σ1v1 σ1 = v1 A u1 Sol’n: SVD of Permuted A + Reshaping

a11 a21 a12 a22 b11  a31 a41 a32 a42   b21 

General Solution Procedure

Minimize

T φ(B, C) = A B C F = A˜ vec(B)vec(C) k − ⊗ k − F

where T vec(A11) T  vec(A21)  T vec(A31) A˜ =  T  .  vec(A12)   T   vec(A22)   T   vec(A32)   

Solution: Compute the SVD UT AV˜ = Σ and set

(opt) (opt) vec(B ) = √σ1 U(:, 1) vec(C ) = √σ1 V (:, 1). Lanczos SVD Algorithm

T Need to compute the dominant eigenvector v1 of A A and the T dominant eigenvector u1 of AA . The power method approach...

T b = initial guess of v1; c = initial guess of u1 ; s = c Ab; while ( Ab sc Av σ u is too big ) k − k2 ≈k 1 − 1 1 k2 c = Ab; c = c/ c ; k k2 b = AT c; b = b/ b ; s = cT Ab; k k2 end

The Lanczos method is better than this because it uses more than just the most recent b and c vectors. It too lives oﬀ of matrix-vector products, i.e., is “sparse friendly.” The Nearest KP-rank r Problem

Use Block Lanczos.

E.g., To minimize

A B C B C B C k − 1 ⊗ 1 − 2 ⊗ 2 − 3 ⊗ 3kF use block Lanczos SVD with block width 3 and set

(opt) vec(Bi ) = √σi U(:, i) i = 1:3 (opt) vec(Ci ) = √σi V (:, i) The Complete KP-SVD

Given: A A N 11 ··· 1 A =  . ... .  A IRp q ij ∈ × AM AMN  1 ···  Form A˜ (MN-by-pq) and apply LAPACK SVD:

rA ˜ T A = σiuivi Xi=1 Then: rA A = σ reshape(u ,M,N) reshape(v ,p,q) i · i ⊗ i Xi=1 The Theorems Follow From This

A A˜ ⇐⇒

m m

rA rA A = σ B C A˜ = σu vT i i ⊗ i ⇐⇒ i i Xi=1 Xi=1 A Related Problem

Problem. Find X and Y to minimize

A (X Y Y X) k − ⊗ − ⊗ kF

Solution. Find vectors x and y so

A˜ (xyT yxT ) k − − kF is minimized and reshape x and y to get X(opt) and Y (opt).

The Schur decomposition of A˜ A˜T is involved. − Another Related Problem

Problem. Find X to minimize

A X X) k − ⊗ kF

Solution. Find vector x so

A˜ xxT k − kF is minimized and reshape to get X(opt).

The Schur decomposition of A˜ + A˜T is involved. A Much More Diﬃcult Problem

min A B C D k − ⊗ ⊗ kF B,C,D

Computational multilinear algebra is ﬁlled with problems like this. Nearest KP Preconditioners Main Idea

(i) Suppose A and an N-by-N block matrix with p-by-p blocks.

(ii) Need to solve Ax = b. Ordinarily this is O(N3p3)

(iii) A system of the form (B C + B C )z = r 1 ⊗ 1 2 ⊗ 2 3 3 T T can be solved in O(N + p ) time. Hint C1ZB1 + C2ZB2 = R.

(iv) If (B C + B C ) A 1 ⊗ 1 2 ⊗ 2 ≈ we have a potential preconditioner. A Block Toeplitz De-Blurring Problem

(Nagy, O’Leary, Kamm(1998))

Need to solve a large block Toeplitz system T x = b

Preconditioner:

T T T ≈ 1 ⊗ 2

Can solve the nearest KP problem with the constraint that the factor matrices T1 and T2 are Toeplitz. A Poisson-Related Problems

Poisson’s equation on a rectangle with a regular (M+1)-by-(N+1) grid discretizes to

Au = (I T + T I ) u = f M ⊗ N M ⊗ N where the T ’s are 1-2-1 tridiagonals. Can be solved very fast.

A new method for the Navier-Stokes problem being developed by Diamessis and Escobar-Vargas leads to linear system where the highly structured A-matrix has KP-rank rA = 16.

Looking for a KP-Preconditioner M of the form M = B C + B C 1 ⊗ 1 2 ⊗ 2 Some Constrained Nearest KP Problems

Joint with Stefan Ragnarsson NOT Inheriting Structure

In the min A B C B,C k − ⊗ kF problem, sometimes B and C fail to inherit A’s special attributes.

Stochastic then B and C are Stochastic If A is Orthogonal not quite Orthogonal KP Approximation of Stochastic Matrices

If A IRn n , B IRn1 n1 , and C IRn2 n2 , and ∈ × ∈ × ∈ × A = B C = stochastic stochastic ⊗ ⊗ then each A-entry has the form bijcpq. The states are clustered into groups G1,...,Gn1 each of size n2 and b = prob(G G ) ij j → i cpq = prob(state q state p within any group) → References:

“Aggregation of Stochastic Automata Networks with Replicas” (A. Benoit, L. Brenner, P. Fernandes, B. Plateau) “Analyzing Markov Chains Using Kronecker Products” (T. Dayer) A Bilinear Optimization Strategy

Given an initial guess C... Repeat Until Converged:

min A B C C ﬁxed ⊗ F B Stochastic k − k

min A B C B ﬁxed ⊗ F C Stochastic k − k

end

These are linear, constrained least squares problems. Reshaping

The problem

min A B C B ﬁxed ⊗ F C Stochastic k − k is equivalent to

min Mx f 2 x 0, Ex = e k − k ≥ where M = I B, f = vec(A), x = vec(C), e = ones(m, 1), ⊗ and E = I eT . m ⊗ The linear constraint forces C (a.k.a. x) to have unit column sums. Example

If 0.2444 0.1950 0.2129 0.1850 0.1202 0.1682  0.2367 0.2712 0.2526 0.2573 0.1857 0.2249  0.1811 0.2348 0.2236 0.1415 0.2900 0.1481 A =    0.1198 0.0949 0.1105 0.1147 0.0822 0.1802     0.1422 0.1091 0.0938 0.1709 0.1405 0.1570     0.0757 0.0949 0.1065 0.1306 0.1813 0.1217   

The matrices B and C obtained by the unconstrained SVD minimization of A B C give approximately stochastic k − ⊗ kF

0.3301 0.2449 0.3246 0.6842 0.5890 B = C =  0.3925 0.3542 0.3657  0.3158 0.4320 0.2611 0.3993 0.2963   Example (Cont’d) Using 0.3301 0.2449 0.3246 0.6842 0.5890 B = C = 0.3925 0.3542 0.3657 0.3158 0.4320   0.2611 0.3993 0.2963   as the initial guess for the successive nonnegative least squares iteration we get the “exactly” stochastic matrices

0.3359 0.2449 0.3289 0.6823 0.5776 BLS = CLS =  0.3984 0.3552 0.3704  0.3177 0.4224 0.2658 0.3998 0.3008  

Work per iteration is roughly quadratic in the dimension of A. MathWorks Optimization Toolbox and PROPACK (R.M. Larsen). A Note on Ordering

This problem assumes that we know how to group the states:

min A B C ⊗ F B, C Stochastic k − k

This doesn’t:

min P AP T B C ⊗ F B, C Stochastic k − k P permutation The Inverse Times Table Problem

Suppose we have the stationary vector x for A, i.e.,

AxA = xA x> 0, sum(xA) = 1

Then

P AP T = B C ⊗ BxB = xB  P xA = xB xC  ⇒ ⊗ CxC = xC  If we know xA, can we ﬁgure out P so P xA is the Kronecker product of two smaller vectors? Inverse TT Cont’d

Suppose

T xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ] and we seek permutation P IR20 20 so that ∈ ×

b1 c1  b2   c2  P xA = b c ⊗  3   3   b   c   4   4   b   5  what are xB and xC? Inverse TT Cont’d

T xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]

b1  b2  reshape(P xA, 5, 4) = b c c c c  3  1 2 3 4  b   4   b   5  24 9 3 27 3 8319  56 21 7 63   7  80 30 10 90 = 10      16 6 2 18   2       32 12 4 36   4      Quick Aside: Nearest Orthogonal KP

.447 .032 .528 .384 .031 .406 .308 .006 .330 − − −  .497 .243 .464 .367 .183 .308 .320 .187 .274  −.205 −.654 .150 .105 .494 −.134 .138 .442 .132  − − − −   .381 .021 .406 .006 .004 .019 .562 .022 .609   − − − − − − −  A =  .404 .167 .327 .021 .003 .001 .562 .290 .548   − − − − − − −   .107 .530 .131 .003 .024 .011 .218 .777 .191   − − − − −   .299 .022 .342 .559 .000 .608 .236 .048 .225   − − − − − − −   .298 .157 .254 .581 .274 .568 .208 .105 .175   − − − −   .104 .419 .097 .233 .802 .165 .059 .257 .087   − − − 

(3-by-3 Orthogonal B ) (3-by-3 Orthogonal C ) ≈ ⊗ Nearest Orthogonal KP (Cont’d)

The unconstrained KP-SVD minimization gives

0.7042 0.5335 0.2713 0.7433 0.0069 0.4633 − − − − − B0 =  0.5563 0.0030 0.4618  C0 =  0.7671 0.3743 0.3931  0.4412 0.8460 0.1679 −0.2822− 1.0394 0.1272  −   −  but BT B I CT C I .643. k 0 0 − 3 k2 ≈ k 0 0 − 3 k2 ≈ After 2 iterations of alternating bilinear clean-up:

0.7025 0.5305 0.4745 0.6701 0.0123 0.7422 − − − − − BLS =  0.5607 0.0019 0.8280  CLS =  0.6962 0.3365 0.6341  0.4383 0.8477 0.2988 −0.2576− 0.9416 0.2169  −   −  giving BT B I BT B I 10 4. k LS LS − 3 k2 ≈ k LS LS − 3 k2 ≈ − Nearest Orthogonal KP (Cont’d)

The problem

min A B C B ﬁxed ⊗ F C Orthogonal k − k is equivalent to an orthogonal procrustes problem with simple SVD solution:

T T U bijAij V =Σ Copt = UV X X

Bojanczyk and Lutoborski (2003) solved a related problem. A Wireless Bandwidth Problem

Given H ,...,H IRp q , ﬁnd C IRp r and W IRq r with 1 N ∈ × ∈ × ∈ × orthonormal columns so

N T 2 ψ(C, W ) = σ1(C HkW ) kX=1 is maximized.

(Joint with J. Nsenga (CETIC) , S. Ragnarsson) A Wireless Bandwidth Problem Cont’d

N N 2 2 ψ(C, W ) = CT H W CT H W k k k2 ≤ k k k2 Xk=1 Xk=1 N 2 = (C W )T vec(H ) k ⊗ k k2 Xk=1

= tr (C W )T S(C W ) = ψ˜(C, W ) ⊗ ⊗ where N T S = vec(Hk)vec(Hk) kX=1 A Wireless Bandwidth Problem Cont’d

Solution Approach. If

S S S ≈ 1 ⊗ 2 then

T ψ˜(C, W ) tr (C W ) (S1 S2)(C W ) ≈ ⊗ ⊗ ⊗ = tr(CT S C) tr(W T S W ) 1 · 2

T n j The trace of Q MQ with Q IR × is maximized if ran(Q) is the span of the r-dimensional∈ dominant invariant subspace of M. An “easy computation.” Connections to Computational Multilinear Algebra A 2d-by-2d Hx = λx Problem

d d T T T H = tijHi Hj + vijklHi Hj HkHl Xij Xijkl

0 1 Hi = I i 1 I d i 2 − ⊗ 0 0 ⊗ 2 −

T = T (1:d, 1:d)

= (1:d, 1:d, 1:d, 1:d) V V

Matrix T is symmetric. Tensor =(vi,j,k,`) also has symmetries.V The H-Matrix

100

200

300

400

500

600

700

800

900

1000 0 200 400 600 800 1000 nz = 104703

1 4 3 3 27 2 11 d nzeros = d d + d d + 1 2 1 64 − 32 64 − 32 − Some Fourth-Order Tensor Symmetries

The tensor in our problem frequently has these symmetries: V

(j,i,k,`) V (i,j,k,`) =  (i,j,`,k) V  V (k,`,i,j) V  Let’s Flatten ... V

(:, :, 1, 1) (:, :, 1, 2) (:, :, 1, 3) (:, :, 1, 4) V V V V  (:, :, 2, 1) (:, :, 2, 2) (:, :, 2, 3) (:, :, 2, 4)  V˜ = V V V V    (:, :, 3, 1) (:, :, 3, 2) (:, :, 3, 3) (:, :, 3, 4)   V V V V   (:, :, 4, 1) (:, :, 4, 2) (:, :, 4, 3) (:, :, 4, 4)   V V V V  and see what happens to (j,i,k,`) V (i,j,k,`) =  (i,j,`,k) V  V (k,`,i,j) V  Flattened Symmetries

Block Symmetry: (i,j,k,`) = (i,j,`,k) V˜ = V˜ V V ⇒ k,` `,k

Symmetric Blocks: (i,j,k,`) = (j,i,k,`) V˜ = V˜ T V V ⇒ `,k `,k

Perfect Shuﬄe Symmetry: (i,j,k,`) = (k,`,i,j) ΠT V˜ Π = V˜ V V ⇒ where Π is a perfect shuﬄe permutation. A Sample ˜ V

280 206 100 206 182 187 100 187 296   206 328 188 182 138 148 187 244 143    100 188 176 187 148 122 296 143 326       206 182 187 328 138 244 188 148 143       182 138 148 138 312 192 148 192 212       187 148 122 244 192 272 143 212 200       100 187 296 188 148 143 176 122 326       187 244 143 148 192 212 122 272 200       296 143 326 143 212 200 326 200 280    The KP-SVD of ˜ is Highly Structured V

r ˜ = σ B B B symmetric V i i ⊗ i i Xi=1

If ˜ σ B B , then V ≈ 1 1 ⊗ 1

(i,j,k,`) σ B (i, j)B (k, `) V ≈ 1 1 1 and... d d d d (i,j,k,`) HT HT H H V ∗ i j k ` Xi=1 Xj=1 Xk=1 X`=1 ≈ d d d d T T σ1 B1(i, j)B1(k, `)Hi Hj HkH` Xi=1 Xj=1 Xk=1 X`=1 = T d d d d σ1  B1(k, `)HkH`  B1(k, `)HkH` kX=1 X`=1 kX=1 X`=1     and H-manipulation reduces from O(d4) to O(d2). Just-For-Fun log(det((A)) The Logarithm of the Determinant

n n Suppose A IR is positive deﬁnite with eigenvalues λ ,...,λn ∈ × 1

The problem of computing

n log(det(A)) = log(λ λn) = log(λ ) 1 ··· k Xk=1 can arise in certain maximum liklihood estimation settings. Solution Approaches

(i)If n is modest, then compute Cholesky factorization A = GGT and use

log(det(A)) = log(det(GGT )) = log(det(G)2) n = 2log(g gnn) = 2 log(g ) 11 ··· kk kX=1

(ii) If A is large and sparse, then Monte Carlo. See Barry and Pace (1999) and also M. McCourt (2008). Nearest KP Appoach

Suppose n = n n and B C is the nearest KP to A with 1 2 ⊗ B IRn1 n1 and C IRn2 n2 . ∈ × ∈ × It can be shown that B and C are sym pos def and

log(det(A)) log(det(B C)) ≈ ⊗ = log(det(B)n2det(C)n1)

= n2log(det(B)) + n1log(det(C))

I.e., the log(det(A))problem breaks down into a pair of (much) smaller log det problems. What If...

What if A B C isn’t good enough? ≈ ⊗ What if A (B C )(B C )(B C ) ≈ 1 ⊗ 1 2 ⊗ 2 3 ⊗ 3 is good enough where B IRmi mi and C IR(n/mi) (n/mi) i ∈ × i ∈ × for i = 1:3.

Then 3 log(det(A)) ((n/m )det(B ) + m det(C )) ≈ i i i i Xi=1 Conclusion

The KP-SVD can serve as a bridge from small n problems to large n problems and more generally, from numerical linear algebra to numerical multilinear algebra.