The Kronecker Product SVD
Charles Van Loan
October 19, 2009 The Kronecker Product
B C is a block matrix whose ij-th block is b C. ⊗ ij E.g.,
b b b C b C 11 12 C = 11 12 b21 b22 ⊗ b21C b22C
Replicated Block Structure The KP-SVD
If A A N 11 ··· 1 A = . ... . A IRp q ij ∈ × AM AMN 1 ··· then there exists a positive integer rA with rA MN so that rA ≤
A = σ B C rA = rankKP(A) k k ⊗ k Xk=1
The KP-singular values: σ σrA > 0. 1 ≥ ··· ≥ M N p q The Bk IR × and Ck IR × satisfy
Let r be a positive integer that satisfies r rA. The problem ≤
min A X k − kF rankKP(X) = r is solved by setting
r X(opt) = σ B C . k k ⊗ k Xk=1 Talk Outline
1. Survey of Essential KP Properties Just enough to get through the talk. 2. Computing the KP-SVD It’s an SVD computation. 3. Nearest KP Preconditioners Solving KP Systems is fast. 4. Some Constrained Nearest KP Problems Nearest (Markov) (Markov) ⊗ 5. Multilinear Connections A low-rank approximation of a 4-dimensional tensor 6. Off-The-Wall / Just-For-Fun Computing log(det(A)) for Large Sparse Pos Def A Essential KP Properties Every bijckl Shows Up
c c c b b 11 12 13 11 12 c c c b b ⊗ 21 22 23 21 22 c c c 31 32 33 =
b11c11 b11c12 b11c13 b12c11 b12c12 b12c13 b11c21 b11c22 b11c23 b12c21 b12c22 b12c23 b c b c b c b c b c b c 11 31 11 32 11 33 12 31 12 32 12 33 b c b c b c b c b c b c 21 11 21 12 21 13 22 11 22 12 22 13 b21c21 b21c22 b21c23 b22c21 b22c22 b22c23 b21c31 b21c32 b21c33 b22c31 b22c32 b22c33 Hierarchical
c c c c 11 12 13 14 d d d b b c c c c 11 12 13 A = 11 12 21 22 23 24 d d d b b ⊗ c c c c ⊗ 21 22 23 21 22 31 32 33 34 d d d c c c c 31 32 33 41 42 43 44
A is a 2-by-2 block matrix whose entries are 4-by-4 block matrices whose entries are 3-by-3 matrices. Algebra
(B C)T = BT CT ⊗ ⊗ (B C) 1 = B 1 C 1 ⊗ − − ⊗ − (B C)(D F ) = BD CF ⊗ ⊗ ⊗ B (C D) =(B C) D ⊗ ⊗ ⊗ ⊗
No: B C = C B ⊗ 6 ⊗ Yes: B C = (Perfect Shuffle)(C B)(Perfect Shuffle)T ⊗ ⊗ The vec Operation
Turns matrices into vectors by stacking columns: 1 2 1 10 3 X = 2 20 vec(X) = ⇒ 10 3 30 20 30 Important special case: 1 1 1 vec(rank-1 matrix) = vec 2 1 10 = 2 10 ⊗ 3 3 Reshaping
The matrix equation
Y = CXBT can be reshaped into a vector equation
vec(Y )=(B C)vec(X) ⊗ Implies fast linear equation solving and fast matrix-vector multi- plication. (More later.) Inheriting Structure
nonsingular nonsingular lower(upper) triangular lower(upper)triangular banded block banded symmetric symmetric If B and C are positive definite then B C is positive definite stochastic ⊗ stochastic Toeplitz block Toeplitz permutations a permutation orthogonal orthogonal Computing the KP-SVD Warm-Up: The Nearest KP Problem
Given A IRm n with m = m m and n = n n . ∈ × 1 2 1 2
Find B IRm1 n1 and C IRm2 n2 so ∈ × ∈ ×
φ(B, C) = A B C = min k − ⊗ kF
A bilinear least squares problem.
Fix B (or C) and it becomes linear in C (or B). Reshaping the Nearest KP Problem
a11 a12 a13 a14 a21 a22 a23 a24 b11 b12 a31 a32 a33 a34 c11 c12 φ(B, C)= b21 b22 a41 a42 a43 a44 − ⊗ c21 c22 b31 b32 a51 a52 a53 a54 a61 a62 a63 a64 F
a11 a21 a12 a22 b11 a31 a41 a32 a42 b21
a51 a61 a52 a62 b31 = c11 c21 c12 c22 a13 a23 a14 a24 − b12 a33 a43 a34 a44 b22 a53 a63 a54 a64 b32 F
!!! Finding the nearest rank-1 matrix is an SVD problem !!! SVD Primer
m n T A IR U AV = Σ=diag(σ ,...,σn) ∈ × ⇒ 1
If U =[u u um] and V =[v v vn] then 1 | 2 | ···| 1 | 2 | ···| The rank-1 matrix σ u vT solves • 1 1 1 min A A˜ F k − k rank(A˜) = 1
T v1 is the dominant eigenvector for A A: • T 2 T A Av1 = σ1v1 Av1 = σ1u1 σ1 = u1 Av1
T u1 is the dominant eigenvector for AA : • T 2 T T T AA u1 = σ1u1 A u1 = σ1v1 σ1 = v1 A u1 Sol’n: SVD of Permuted A + Reshaping
a11 a12 a13 a14 a21 a22 a23 a24 b11 b12 a31 a32 a33 a34 c11 c12 φ(B, C)= b21 b22 a41 a42 a43 a44 − ⊗ c21 c22 b31 b32 a51 a52 a53 a54 a61 a62 a63 a64 F
a11 a21 a12 a22 b11 a31 a41 a32 a42 b21
a51 a61 a52 a62 b31 = c11 c21 c12 c22 a13 a23 a14 a24 − b12 a33 a43 a34 a44 b22 a53 a63 a54 a64 b32 F
General Solution Procedure
Minimize
T φ(B, C) = A B C F = A˜ vec(B)vec(C) k − ⊗ k − F
where T vec(A11) T vec(A21) T vec(A31) A˜ = T . vec(A12) T vec(A22) T vec(A32)
Solution: Compute the SVD UT AV˜ = Σ and set
(opt) (opt) vec(B ) = √σ1 U(:, 1) vec(C ) = √σ1 V (:, 1). Lanczos SVD Algorithm
T Need to compute the dominant eigenvector v1 of A A and the T dominant eigenvector u1 of AA . The power method approach...
T b = initial guess of v1; c = initial guess of u1 ; s = c Ab; while ( Ab sc Av σ u is too big ) k − k2 ≈k 1 − 1 1 k2 c = Ab; c = c/ c ; k k2 b = AT c; b = b/ b ; s = cT Ab; k k2 end
The Lanczos method is better than this because it uses more than just the most recent b and c vectors. It too lives off of matrix-vector products, i.e., is “sparse friendly.” The Nearest KP-rank r Problem
Use Block Lanczos.
E.g., To minimize
A B C B C B C k − 1 ⊗ 1 − 2 ⊗ 2 − 3 ⊗ 3kF use block Lanczos SVD with block width 3 and set
(opt) vec(Bi ) = √σi U(:, i) i = 1:3 (opt) vec(Ci ) = √σi V (:, i) The Complete KP-SVD
Given: A A N 11 ··· 1 A = . ... . A IRp q ij ∈ × AM AMN 1 ··· Form A˜ (MN-by-pq) and apply LAPACK SVD:
rA ˜ T A = σiuivi Xi=1 Then: rA A = σ reshape(u ,M,N) reshape(v ,p,q) i · i ⊗ i Xi=1 The Theorems Follow From This
A A˜ ⇐⇒
m m
rA rA A = σ B C A˜ = σu vT i i ⊗ i ⇐⇒ i i Xi=1 Xi=1 A Related Problem
Problem. Find X and Y to minimize
A (X Y Y X) k − ⊗ − ⊗ kF
Solution. Find vectors x and y so
A˜ (xyT yxT ) k − − kF is minimized and reshape x and y to get X(opt) and Y (opt).
The Schur decomposition of A˜ A˜T is involved. − Another Related Problem
Problem. Find X to minimize
A X X) k − ⊗ kF
Solution. Find vector x so
A˜ xxT k − kF is minimized and reshape to get X(opt).
The Schur decomposition of A˜ + A˜T is involved. A Much More Difficult Problem
min A B C D k − ⊗ ⊗ kF B,C,D
Computational multilinear algebra is filled with problems like this. Nearest KP Preconditioners Main Idea
(i) Suppose A and an N-by-N block matrix with p-by-p blocks.
(ii) Need to solve Ax = b. Ordinarily this is O(N3p3)
(iii) A system of the form (B C + B C )z = r 1 ⊗ 1 2 ⊗ 2 3 3 T T can be solved in O(N + p ) time. Hint C1ZB1 + C2ZB2 = R.
(iv) If (B C + B C ) A 1 ⊗ 1 2 ⊗ 2 ≈ we have a potential preconditioner. A Block Toeplitz De-Blurring Problem
(Nagy, O’Leary, Kamm(1998))
Need to solve a large block Toeplitz system T x = b
Preconditioner:
T T T ≈ 1 ⊗ 2
Can solve the nearest KP problem with the constraint that the factor matrices T1 and T2 are Toeplitz. A Poisson-Related Problems
Poisson’s equation on a rectangle with a regular (M+1)-by-(N+1) grid discretizes to
Au = (I T + T I ) u = f M ⊗ N M ⊗ N where the T ’s are 1-2-1 tridiagonals. Can be solved very fast.
A new method for the Navier-Stokes problem being developed by Diamessis and Escobar-Vargas leads to linear system where the highly structured A-matrix has KP-rank rA = 16.
Looking for a KP-Preconditioner M of the form M = B C + B C 1 ⊗ 1 2 ⊗ 2 Some Constrained Nearest KP Problems
Joint with Stefan Ragnarsson NOT Inheriting Structure
In the min A B C B,C k − ⊗ kF problem, sometimes B and C fail to inherit A’s special attributes.
Stochastic then B and C are Stochastic If A is Orthogonal not quite Orthogonal KP Approximation of Stochastic Matrices
If A IRn n , B IRn1 n1 , and C IRn2 n2 , and ∈ × ∈ × ∈ × A = B C = stochastic stochastic ⊗ ⊗ then each A-entry has the form bijcpq. The states are clustered into groups G1,...,Gn1 each of size n2 and b = prob(G G ) ij j → i cpq = prob(state q state p within any group) → References:
“Aggregation of Stochastic Automata Networks with Replicas” (A. Benoit, L. Brenner, P. Fernandes, B. Plateau) “Analyzing Markov Chains Using Kronecker Products” (T. Dayer) A Bilinear Optimization Strategy
Given an initial guess C... Repeat Until Converged:
min A B C C fixed ⊗ F B Stochastic k − k
min A B C B fixed ⊗ F C Stochastic k − k
end
These are linear, constrained least squares problems. Reshaping
The problem
min A B C B fixed ⊗ F C Stochastic k − k is equivalent to
min Mx f 2 x 0, Ex = e k − k ≥ where M = I B, f = vec(A), x = vec(C), e = ones(m, 1), ⊗ and E = I eT . m ⊗ The linear constraint forces C (a.k.a. x) to have unit column sums. Example
If 0.2444 0.1950 0.2129 0.1850 0.1202 0.1682 0.2367 0.2712 0.2526 0.2573 0.1857 0.2249 0.1811 0.2348 0.2236 0.1415 0.2900 0.1481 A = 0.1198 0.0949 0.1105 0.1147 0.0822 0.1802 0.1422 0.1091 0.0938 0.1709 0.1405 0.1570 0.0757 0.0949 0.1065 0.1306 0.1813 0.1217
The matrices B and C obtained by the unconstrained SVD mini- mization of A B C give approximately stochastic k − ⊗ kF
0.3301 0.2449 0.3246 0.6842 0.5890 B = C = 0.3925 0.3542 0.3657 0.3158 0.4320 0.2611 0.3993 0.2963 Example (Cont’d) Using 0.3301 0.2449 0.3246 0.6842 0.5890 B = C = 0.3925 0.3542 0.3657 0.3158 0.4320 0.2611 0.3993 0.2963 as the initial guess for the successive nonnegative least squares iteration we get the “exactly” stochastic matrices
0.3359 0.2449 0.3289 0.6823 0.5776 BLS = CLS = 0.3984 0.3552 0.3704 0.3177 0.4224 0.2658 0.3998 0.3008
Work per iteration is roughly quadratic in the dimension of A. MathWorks Optimization Toolbox and PROPACK (R.M. Larsen). A Note on Ordering
This problem assumes that we know how to group the states:
min A B C ⊗ F B, C Stochastic k − k
This doesn’t:
min P AP T B C ⊗ F B, C Stochastic k − k P permutation The Inverse Times Table Problem
Suppose we have the stationary vector x for A, i.e.,
AxA = xA x> 0, sum(xA) = 1
Then
P AP T = B C ⊗ BxB = xB P xA = xB xC ⇒ ⊗ CxC = xC If we know xA, can we figure out P so P xA is the Kronecker product of two smaller vectors? Inverse TT Cont’d
Suppose
T xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ] and we seek permutation P IR20 20 so that ∈ ×
b1 c1 b2 c2 P xA = b c ⊗ 3 3 b c 4 4 b 5 what are xB and xC? Inverse TT Cont’d
T xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]
b1 b2 reshape(P xA, 5, 4) = b c c c c 3 1 2 3 4 b 4 b 5 24 9 3 27 3 8319 56 21 7 63 7 80 30 10 90 = 10 16 6 2 18 2 32 12 4 36 4 Quick Aside: Nearest Orthogonal KP
.447 .032 .528 .384 .031 .406 .308 .006 .330 − − − .497 .243 .464 .367 .183 .308 .320 .187 .274 −.205 −.654 .150 .105 .494 −.134 .138 .442 .132 − − − − .381 .021 .406 .006 .004 .019 .562 .022 .609 − − − − − − − A = .404 .167 .327 .021 .003 .001 .562 .290 .548 − − − − − − − .107 .530 .131 .003 .024 .011 .218 .777 .191 − − − − − .299 .022 .342 .559 .000 .608 .236 .048 .225 − − − − − − − .298 .157 .254 .581 .274 .568 .208 .105 .175 − − − − .104 .419 .097 .233 .802 .165 .059 .257 .087 − − −
(3-by-3 Orthogonal B ) (3-by-3 Orthogonal C ) ≈ ⊗ Nearest Orthogonal KP (Cont’d)
The unconstrained KP-SVD minimization gives
0.7042 0.5335 0.2713 0.7433 0.0069 0.4633 − − − − − B0 = 0.5563 0.0030 0.4618 C0 = 0.7671 0.3743 0.3931 0.4412 0.8460 0.1679 −0.2822− 1.0394 0.1272 − − but BT B I CT C I .643. k 0 0 − 3 k2 ≈ k 0 0 − 3 k2 ≈ After 2 iterations of alternating bilinear clean-up:
0.7025 0.5305 0.4745 0.6701 0.0123 0.7422 − − − − − BLS = 0.5607 0.0019 0.8280 CLS = 0.6962 0.3365 0.6341 0.4383 0.8477 0.2988 −0.2576− 0.9416 0.2169 − − giving BT B I BT B I 10 4. k LS LS − 3 k2 ≈ k LS LS − 3 k2 ≈ − Nearest Orthogonal KP (Cont’d)
The problem
min A B C B fixed ⊗ F C Orthogonal k − k is equivalent to an orthogonal procrustes problem with simple SVD solution:
T T U bijAij V =Σ Copt = UV X X
Bojanczyk and Lutoborski (2003) solved a related problem. A Wireless Bandwidth Problem
Given H ,...,H IRp q , find C IRp r and W IRq r with 1 N ∈ × ∈ × ∈ × orthonormal columns so
N T 2 ψ(C, W ) = σ1(C HkW ) kX=1 is maximized.
(Joint with J. Nsenga (CETIC) , S. Ragnarsson) A Wireless Bandwidth Problem Cont’d
N N 2 2 ψ(C, W ) = CT H W CT H W k k k2 ≤ k k k2 Xk=1 Xk=1 N 2 = (C W )T vec(H ) k ⊗ k k2 Xk=1
= tr (C W )T S(C W ) = ψ˜(C, W ) ⊗ ⊗ where N T S = vec(Hk)vec(Hk) kX=1 A Wireless Bandwidth Problem Cont’d
Solution Approach. If
S S S ≈ 1 ⊗ 2 then
T ψ˜(C, W ) tr (C W ) (S1 S2)(C W ) ≈ ⊗ ⊗ ⊗ = tr(CT S C) tr(W T S W ) 1 · 2
T n j The trace of Q MQ with Q IR × is maximized if ran(Q) is the span of the r-dimensional∈ dominant invariant subspace of M. An “easy computation.” Connections to Computational Multilinear Algebra A 2d-by-2d Hx = λx Problem
d d T T T H = tijHi Hj + vijklHi Hj HkHl Xij Xijkl
0 1 Hi = I i 1 I d i 2 − ⊗ 0 0 ⊗ 2 −
T = T (1:d, 1:d)
= (1:d, 1:d, 1:d, 1:d) V V
Matrix T is symmetric. Tensor =(vi,j,k,`) also has symmetries.V The H-Matrix
0
100
200
300
400
500
600
700
800
900
1000 0 200 400 600 800 1000 nz = 104703
1 4 3 3 27 2 11 d nzeros = d d + d d + 1 2 1 64 − 32 64 − 32 − Some Fourth-Order Tensor Symmetries
The tensor in our problem frequently has these symmetries: V
(j,i,k,`) V (i,j,k,`) = (i,j,`,k) V V (k,`,i,j) V Let’s Flatten ... V
(:, :, 1, 1) (:, :, 1, 2) (:, :, 1, 3) (:, :, 1, 4) V V V V (:, :, 2, 1) (:, :, 2, 2) (:, :, 2, 3) (:, :, 2, 4) V˜ = V V V V (:, :, 3, 1) (:, :, 3, 2) (:, :, 3, 3) (:, :, 3, 4) V V V V (:, :, 4, 1) (:, :, 4, 2) (:, :, 4, 3) (:, :, 4, 4) V V V V and see what happens to (j,i,k,`) V (i,j,k,`) = (i,j,`,k) V V (k,`,i,j) V Flattened Symmetries
Block Symmetry: (i,j,k,`) = (i,j,`,k) V˜ = V˜ V V ⇒ k,` `,k
Symmetric Blocks: (i,j,k,`) = (j,i,k,`) V˜ = V˜ T V V ⇒ `,k `,k
Perfect Shuffle Symmetry: (i,j,k,`) = (k,`,i,j) ΠT V˜ Π = V˜ V V ⇒ where Π is a perfect shuffle permutation. A Sample ˜ V
280 206 100 206 182 187 100 187 296 206 328 188 182 138 148 187 244 143 100 188 176 187 148 122 296 143 326 206 182 187 328 138 244 188 148 143 182 138 148 138 312 192 148 192 212 187 148 122 244 192 272 143 212 200 100 187 296 188 148 143 176 122 326 187 244 143 148 192 212 122 272 200 296 143 326 143 212 200 326 200 280 The KP-SVD of ˜ is Highly Structured V
r ˜ = σ B B B symmetric V i i ⊗ i i Xi=1
If ˜ σ B B , then V ≈ 1 1 ⊗ 1
(i,j,k,`) σ B (i, j)B (k, `) V ≈ 1 1 1 and... d d d d (i,j,k,`) HT HT H H V ∗ i j k ` Xi=1 Xj=1 Xk=1 X`=1 ≈ d d d d T T σ1 B1(i, j)B1(k, `)Hi Hj HkH` Xi=1 Xj=1 Xk=1 X`=1 = T d d d d σ1 B1(k, `)HkH` B1(k, `)HkH` kX=1 X`=1 kX=1 X`=1 and H-manipulation reduces from O(d4) to O(d2). Just-For-Fun log(det((A)) The Logarithm of the Determinant
n n Suppose A IR is positive definite with eigenvalues λ ,...,λn ∈ × 1
The problem of computing
n log(det(A)) = log(λ λn) = log(λ ) 1 ··· k Xk=1 can arise in certain maximum liklihood estimation settings. Solution Approaches
(i)If n is modest, then compute Cholesky factorization A = GGT and use
log(det(A)) = log(det(GGT )) = log(det(G)2) n = 2log(g gnn) = 2 log(g ) 11 ··· kk kX=1
(ii) If A is large and sparse, then Monte Carlo. See Barry and Pace (1999) and also M. McCourt (2008). Nearest KP Appoach
Suppose n = n n and B C is the nearest KP to A with 1 2 ⊗ B IRn1 n1 and C IRn2 n2 . ∈ × ∈ × It can be shown that B and C are sym pos def and
log(det(A)) log(det(B C)) ≈ ⊗ = log(det(B)n2det(C)n1)
= n2log(det(B)) + n1log(det(C))
I.e., the log(det(A))problem breaks down into a pair of (much) smaller log det problems. What If...
What if A B C isn’t good enough? ≈ ⊗ What if A (B C )(B C )(B C ) ≈ 1 ⊗ 1 2 ⊗ 2 3 ⊗ 3 is good enough where B IRmi mi and C IR(n/mi) (n/mi) i ∈ × i ∈ × for i = 1:3.
Then 3 log(det(A)) ((n/m )det(B ) + m det(C )) ≈ i i i i Xi=1 Conclusion
The KP-SVD can serve as a bridge from small n problems to large n problems and more generally, from numerical linear algebra to numerical multilinear algebra.