<<

and Rank

Lek-Heng Lim

Workshop on Tensor Decompositions and Applications CIRM, Luminy, France August 29–September 2, 2005

Thanks: NSF DMS 01-01364 Collaborators

Pierre Comon Laboratoire I3S Universit´ede Nice Sophia Antipolis

Vin de Silva Department of Mathematics Stanford University

2 Multiplication

Let f : U → V and g : V → W be linear maps; U, V, W vector spaces over R of dimensions n, m, l.

With choice of bases on U, V, W , g, f have matrix representations l×m m×n A = [aij] ∈ R ,B = [bjk] ∈ R .

The matrix representation of h = g ◦ f (ie. h(x) := g(f(x))) is l×n then C = [cik] ∈ R where n c := X a b . ik j=1 ij jk

Similarly for bilinear g : V1 × V2 → R and linear f1 : U1 → V1, f2 : d ×d d ×s U2 → V2 with matrix representations A ∈ R 1 2, B1 ∈ R 1 1, d ×s B2 ∈ R 2 2.

The composite map h, where h(x, y) := g(f1(x), f2(y)), has ma- trix representation

> s1×s2 C = B2 AB1 ∈ R . 3 Multilinear Matrix Multiplication

Do the same for g : V1 × · · · × Vk → R and linear maps f1 : U1 → V1, . . . , fk : Uk → Vk; dim(Vi) = si, dim(Ui) = di.

With choice of bases on Vi’s and Ui’s, g is represented by A = d1×···×dk and by = [ 1 ] d1×s1 = aj1···j ∈ R f1, . . . , fk M1 mj i ∈ R ,...,Mk J kK 1 1 [mk ] ∈ dk×sk. jkik R

If we compose g by f1, . . . , fk to get h : U1 × · · · × Uk → R defined by

h(x1,..., xk) = g(f(x1), . . . , f(xk)), s1×···×sk then h is represented by ci1···i ∈ R where J kK c := Pd1 ··· Pdk a m1 ··· mk . (1) i1···ik j1=1 jk=1 j1···jk j1i1 jkik

The covariant multilinear matrix multiplication will be written

s1×···×sk A(M1,...,Mk) := ci1···i ∈ R . J kK 4 Contravariant Version

The contravariant multilinear matrix multiplication of aj1···jk ∈ d1×···×dk by matrices L = [`1 ] ∈ r1×d1,...,L =J [`k ]K ∈ R 1 i1j1 R k ikjk r ×d R k k is defined by

r1×···×rk (L1,...,Lk)A = bi1···i ∈ R , J kK b := Pd1 ··· Pdk `1 ··· `k a . (2) i1···ik j1=1 jk=1 i1j1 ikjk j1···jk ∗ This comes from the composition of a multilinear map g : V1 × ∗ · · · × Vk → R by linear maps f1 : V1 → U1, . . . , fk : Vk → Uk.

Simple relation if we disregard covariance/contravariance:

> > (L1,...,Lk)A = A(L1 ,...,Lk ) > > A(M1,...,Mk) = (M1 ,...,Mk )A.

> † Works over C too (replace Li by Li ). 5 Properties

d ×···×d r ×d • Let A, B ∈ R 1 k and λ, µ ∈ R. Let L1 ∈ R 1 1,..., Lk ∈ r ×d R k k. Then

(L1,...,Lk)(λA + µB) = λ(L1,...,Lk)A + µ(L1,...,Lk)B.

d ×···×d r ×d r ×d • Let A ∈ R 1 k. Let L1 ∈ R 1 1,..., Lk ∈ R k k, and s ×r s ×r M1 ∈ R 1 1,..., Mk ∈ R k k. Then

(M1,..., Mk)(L1,..., Lk)A = (M1L1,..., MkLk)A s ×d where MiLi ∈ R i i is simply the matrix-matrix product of Mi and Li. d ×···×d r ×d • Let A ∈ R 1 k and λ, µ ∈ R. Let L1 ∈ R 1 1,..., Lj, Mj ∈ rj×dj r ×d R ,...,Lk ∈ R k k. Then

A(L1, . . . , λLj + µMj,...,Lk) = λ(L1,..., Lj,...,Lk)A + µ(L1,..., Mj,...,Lk)A.

6 Aside: Relation with Kronecker Product

d ×···×d d ···d Forgetful map R 1 k → R 1 k, A 7→ vec(A) (‘forgets’ the multilinear structure), then

vec((L1,...,Lk)A) = L1 ⊗ · · · ⊗ Lk vec(A). d ···d ×d ···d where L1 ⊗ · · · ⊗ Lk ∈ R 1 k 1 k is the Kronecker product of L1,...,Lk.

7 Matrix Techniques

m×n×l m×n×l Start with R and C . l = 2 is well understood, may m×n 2 m×n 2 be regarded as pairs of matrices (A, B) ∈ (C ) or (R ) , m×n or equivalently, as a matrix pencil λA + µB ∈ C[λ, µ] or m×n R[λ, µ] . Kronecker-Weierstrass Theory. There exist S ∈ GL(m),T ∈ GL(n) such that (SAT, SBT ) can be decomposed into block pairs of the following forms 1  0  0 1 1 0  ..   ..  (p+1)×p 0 . , 1 . ∈ R ,  .   .   .. 1  .. 0 0 1 1 0  0 1  1 0 0 1 , ∈ q×(q+1),  ......   ......  R 1 0 0 1 1  0 −a0  . . 1 1 .. . , ∈ r×r.  ...   ..  R . 0 −ar−2 1 1 −ar−1 Likewise for C. Similar but simpler results obtained by Jos ten Berge for generic pairs. 8 Larger Sizes and Higher Orders

Want to obtain results as general as possible — for of arbitrary size and order over both R and C. For larger values of k or d1, . . . , dk, techniques relying on multilinear matrix multipli- cations become increasingly less effective.

Inherent limitation:

d1×···×d k dim(R k) = d1 ····· dk = O(d ) while 2 2 2 dim(GL(d1) × · · · × GL(dk)) = d1 + ··· + dk = O(kd ) and 2 dim(O(d1)×· · ·×O(dk)) = d1(d1−1)/2+···+dk(dk−1)/2 = O(kd ). d ×···×d The action of GL(d1)×· · ·×GL(dk) on R 1 k has uncountably many orbits,

{(L1,...,Lk)A | (L1,...,Lk) ∈ GL(d1) × · · · × GL(dk)}, as soon as di > 2, k > 4. 9 Multilinear Functional and its Gradient

d ×···×d Multilinear functional associated with A ∈ R 1 k, ie. d1 d fA : R × · · · × R k → R, (3) Pd1 Pdk 1 k (x1,..., x ) 7→ ··· a x ··· x , k j1=1 jk=1 j1···jk j1 jk can be written as

fA(x1,..., xk) = A(x1,..., xk) (4) where the rhs is the right multilinear multiplication by xi = (xi , . . . , xi )>, regarded as a d × 1 matrix. 1 di i

Gradient of fA may be written as

∇fA = (∇x1fA,..., ∇xkfA) where   ∂fA ∂fA ∇ if (x1,..., x ) =  ,...,  = A(x1,..., xi−1,I , x +1,..., x ). x A k ∂xi ∂xi di i k 1 di denotes identity matrix. Idi di × di 10 Hyperdeterminant

(d +1)×···×(d +1) Work in C 1 k for the time being (di ≥ 1). Consider (d1+1)×···×(d +1) S := {A ∈ C k | ∇fA(x1,..., xk) = 0 for some non-zero (x1,..., xk)}. Theorem (Gelfand, Kapranov, Zelevinsky, 1992). S is a hypersurface if and only if X dj ≤ di i6=j for all j = 1, . . . , k. Let ∆ be the equation of the hypersurface, ie. a multivariate polynomial in the entries of A such that (d +1)×···×(d +1) S = {A ∈ C 1 k | ∆(A) = 0}. Then ∆ may be chosen to have coefficients.

m×n For C , the condition becomes m ≤ n and n ≤ m — that’s why matrix is only defined for square matrices.

Since ∆ has integer coefficients, ∆(A) is real-valued for A ∈ (d +1)×···×(d +1) R 1 k . 11 Geometric View

(d +1)×···×(d +1) d +1 Let X = {x1 ⊗ · · · ⊗ xk ∈ C 1 k | xi ∈ C i } be the (smooth) manifold of decomposable tensors (X oftened called the Segre variety).

(d +1)×···×(d +1) Let A ∈ C 1 k . Then the condition ∇fA(x1,..., xk) = 0 for some non-zero (x1,..., xk) is equivalent to saying that the hyperplane orthogonal to A, ie.

(d1+1)×···×(d +1) HA := {B ∈ C k | hA, Bi = 0} contains a tangent to X at the point x1 ⊗ · · · ⊗ xk. This may also be taken as an alternative definition of the hyperdeterminant ∆(A).

Projective duality: X∗ = S.

12 ¢(A) = 0

rank(A) = 1 ¢(B) ≠ 0

rank(B) = 2 rank(B) = 2 ¢(A) = 0

¢(B) = 0

rank(A) = 1 Minor Inaccuracy

(d +1)×···×(d +1) Should really be working in projective spaces P (C 1 k ) = (d +1)···(d +1)−1 P 1 k . This is the set of equivalence classes (d +1)×···×(d +1) × [A] := {λA ∈ C 1 k | λ ∈ C }.

(d +1)×···×(d +1) Thing to note is that the for any A ∈ C 1 k and × λ ∈ C ,

rank⊗(λA) = rank⊗(A). (d +1)×···×(d +1) So outer-product rank is well-defined in P (C 1 k ), (d +1)×···×(d +1) ie. given [A] ∈ P (C 1 k ) define

rank⊗([A]) = rank⊗(A) for any A ∈ [A].

13 Examples

A. Cayley, “On the theory of linear transformation,” Cambridge Math. J., 4 (1845), pp. 193–209.

2×2×2 Hyperdeterminant of A = [[aijk]] ∈ R is " " # " #! 1 a a a a ∆(A) = det 000 010 + 100 110 4 a001 a011 a101 a111 " # " #!#2 a a a a − det 000 010 − 100 110 a001 a011 a101 a111 " # " # a a a a − 4 det 000 010 det 100 110 . a001 a011 a101 a111

A result that parallels the matrix case is the following: the system

14 of bilinear equations

a000x0y0 + a010x0y1 + a100x1y0 + a110x1y1 = 0, a001x0y0 + a011x0y1 + a101x1y0 + a111x1y1 = 0, a000x0z0 + a001x0z1 + a100x1z0 + a101x1z1 = 0, a010x0z0 + a011x0z1 + a110x1z0 + a111x1z1 = 0, a000y0z0 + a001y0z1 + a010y1z0 + a011y1z1 = 0, a100y0z0 + a101y0z1 + a110y1z0 + a111y1z1 = 0, has a non-trivial solution iff ∆(A) = 0. Examples

2×2×3 Hyperdeterminant of A = [[aijk]] ∈ R is     a000 a001 a002 a100 a101 a102     ∆(A) = det a100 a101 a102 det a010 a011 a012 a010 a011 a012 a110 a111 a112     a000 a001 a002 a000 a001 a002     − det a100 a101 a102 det a010 a011 a012 a110 a111 a112 a110 a111 a112 Again, the following is true:

a000x0y0 + a010x0y1 + a100x1y0 + a110x1y1 = 0,

a001x0y0 + a011x0y1 + a101x1y0 + a111x1y1 = 0,

a002x0y0 + a012x0y1 + a102x1y0 + a112x1y1 = 0,

a000x0z0 + a001x0z1 + a002x0z2 + a100x1z0 + a101x1z1 + a102x1z2 = 0,

a010x0z0 + a011x0z1 + a012x0z2 + a110x1z0 + a111x1z1 + a112x1z2 = 0,

a000y0z0 + a001y0z1 + a002y0z2 + a010y1z0 + a011y1z1 + a012y1z2 = 0,

a100y0z0 + a101y0z1 + a102y0z2 + a110y1z0 + a111y1z1 + a112y1z2 = 0, has a non-trivial solution iff ∆(A) = 0. 15 For more examples, see:

I.M. Gelfand, M.M. Kapranov, and A.V. Zelevinsky, Discrimi- nants, resultants, and multidimensional determinants, Birkh¨auser Publishing, Boston, MA, 1994. Another Explanation for Degeneracy

Degeneracy here means: a sequence of tensors Bn of rank ≤ r converging to a tensor A of rank r+1, ie. A can be approximated arbitrarily well by tensors of lower-rank. In particular, A has no best rank-r approximations.

Question: Why do degeneracy occur in PARAFAC?

One Explanation: Alwin’s talk

Alternative explanation: Degeneracy happens when a sequence of r-secants to X converges to a (r + 1)-secant. Furthermore, this (r + 1)-secant is always tangent to X (since X is a smooth manifold). This doesn’t happen for order-2 tensors because the m×n geometry of X = {A ∈ C | rank(A) ≤ 1} prevents this.

16 rank(B ) = 2 rank(A) = 1

rank(C) = 3 Relation between the two explanations

rank(B) = 2 ¢(A) = 0

¢(B) = 0

rank(A) = 1

17 ¢(A) = 0

rank(A) = 1 ¢(B) ≠ 0

rank(B) = 2 Typical Rank(s) for Real Tensors

Conjecture. For any d1, . . . , dk ≥ 1 satisfying X dj ≤ di i6=j for all j, the outer-product rank is constant on the sets {A ∈ (d +1)×···×(d +1) (d +1)×···×(d +1) R 1 k | ∆(A) < 0} and {A ∈ R 1 k | ∆(A) > 0} (but may take different values).

Corollary. If the conjecture is true, then there exist at most two (d +1)×···×(d +1) typical ranks for R 1 k .

18 R

¢(A) > 0

¢(A) = 0

¢(A) < 0 Generic rank for Complex Tensors

Theorem. For any d1, . . . , dk ≥ 1, a (unique) generic rank always (d +1)×···×(d +1) exist for C 1 k .

C

¢(A) ≠ 0

¢(A) = 0

¢(A) ≠ 0

19 Extending hyperdeterminant

d ×···×d Let A ∈ R 1 k, di ≥ r = rank⊗(A). Let ∆ denote the hyper- r×···×r in R .

r×d Easy to show: There exist Qi ∈ R i with orthonormal columns, i = 1, . . . , k, such that

r×···×r (Q1,...,Qk)A ∈ R .

Lemma. If A can be approximated arbitrarily closely by tensors d ×···×d of rank ≤ r in R 1 k, then (Q1,...,Qk)A can be approximated r ×···×r arbitrarily closely by tensors of rank ≤ r in R 1 k.

Define r-hyperdeterminant of A by

∆k,r(A) := ∆((Q1,...,Qk)A).

Whether ∆k,r(A) = 0 or not is independent of choice of Q1,...,Qk. 20 Conditioning

J.W. Demmel, “On condition numbers and the distance to the nearest ill-posed problem,” Numer. Math., 51 (1987), no. 3, pp. 251–289.

J.W. Demmel, “The geometry of ill-conditioning,” J. Complex- ity, 3 (1987), no. 2, pp. 201–229.

J.W. Demmel, “The probability that a numerical analysis prob- lem is difficult,” Math. Comp., 50 (1988), no. 182, pp. 449–480.

An ill-posed problem is one that lacks either existence or unique- ness or stability (of a solution).

Condition number in a nutshell: The condition number of a well-posed problem measures the distance of that problem to

21 the manifold of ill-posed problems. The larger the condition number, the closer the problem is to an ill-posed one.

Example. The manifold of ill-posed n × n linear system is S = n×n {A ∈ R | det(A) = 0}. For a non-singular B, the (normalized) condition number 1 kB−1k is the distance of B to S. Note that the value of det(B) does not enter into the picture — det(B) can be arbitrarily small for well-conditioned B. Using

d ×···×d The set {A ∈ R 1 k | ∆k,r(A) = 0} characterizes the ill-posed d ×···×d best rank-r approximation problems in R 1 k.

The (normalized) condition number of the problem of finding the best rank-r approximation to B is given by the reciprocal of d ×···×d the distance of B to {A ∈ R 1 k | ∆k,r(A) = 0}.

Note again that the exact value of ∆k,r(A) is unimportant (only whether it is 0).

22 Example: Order-3, Rank-2

l m Easy Fact. Let l, m, n ≥ 2. Let x1, y1 ∈ R , x2, y2 ∈ R , x3, y3 ∈ n R and define l×m×n A := x1 ⊗ x2 ⊗ y3 + x1 ⊗ y2 ⊗ x3 + y1 ⊗ x2 ⊗ x3 ∈ R .

Then rank⊗(A) = 3 if and only if xi, yi are linearly independent, i = 1, 2, 3.

Theorem (de Silva and L., 2004). Let l, m, n ≥ 2. Let A ∈ l×m×n R with rank⊗(A) ≥ 3. The following are equivalent:

l×m×n 1. A is the limit of a sequence Bn ∈ R with rank⊗(Bn) ≤ 2,

2. ∆3,2(A) = 0,

3. A = x1 ⊗ x2 ⊗ y3 + x1 ⊗ y2 ⊗ x3 + y1 ⊗ x2 ⊗ x3.

23