International Journal of Neural Systems, Vol. 13, No. 2 (2003) 1–16 c World Scientific Publishing Company

SINGULAR VALUE DECOMPOSITION LEARNING ON DOUBLE STIEFEL MANIFOLD

SIMONE FIORI Faculty of Engineering, Perugia University, Loc. Pentima bassa, 21, I-05100 Terni (Italy) [email protected]

Received 25 October 2002 Revised 5 March 2003 Accepted 5 March 2003

The aim of this paper is to present a unifying view of four SVD-neural-computation techniques found in the scientific literature and to present some theoretical results on their behavior. The considered SVD neural algorithms are shown to arise as Riemannian-gradient flows on double Stiefel manifold and their geometric and dynamical properties are investigated with the help of differential geometry.

Keywords: Singular value decomposition; orthogonal group; Stiefel manifold; differential geometry; Lyapunov stability.

1. Introduction unsupervised learning by non-gradient techniques: For instance, in Ref. 1 a new technique was intro- The computation of the singular value decomposi- duced to enhance the learning capabilities of lin- tion (SVD) of a non-square matrix, also referred ear and MLP-type neural networks by Riemannian to as Autonne–Eckart–Young decomposition18,30 gradient, in Refs. 4 and 13 a theoretical deriva- plays a central role in several signal/data au- tion/analysis of new principal/minor subspace rules tomatic processing. Originally developed in nu- has been carried out; also, in Refs. 9 and 12 a large merical algebra to provide quantitative informa- class of learning rules for MLP-type neural networks, tion about the structure of linear systems of based on first/second-order non-gradient dynamics equations, it has found widespread applications and Lie-group flows, has been introduced and dis- 7–8,17,31 e.g. in signal processing, pattern recogni- cussed by the present Author as a theoretical frame- 26 24,30 tion and classification, automatic control, dig- work for explaining many learning paradigms ap- 28 ital circuit design, time-series prediction, image peared on the scientific literature, while papers10,11 6,21,25 2 processing and connectionism. were devoted to a particular algorithm of this class, Recently, some efforts have been devoted based on the rational kinematics of rigid bodies and to SVD computation by neural networks in its applications to real- and complex-valued signal 5,27,33 the neural community; the related learn- processing. ing theories emerge as interesting extensions The aim of this paper is to present some the- of the well-known neural principal compo- oretical notes on parallel SVD computation by 26 nent/subspace analysis techniques, long inves- unsupervised non-gradient neural learning, with tigated during the last 15 years. Also, recently special reference to learning theories involving a new light has been shed on adaptive second- weight-flows on double Stiefel manifold. Par- order (as well as higher-order) statistical de- allel techniques are considered in opposition to composition theories by researchers interested in sequential ones that employ the deflation method,

1 2 S. Fiori implemented by laterally connected neural architec- all-zero values except for the first r diagonal entries, tures, to discard previously computed vectors from termed singular values. It is easily checked that the original data.5,8 In particular, we recall from the sci- columns of U coincide with the eigenvectors of ZZ∗ entific literature four neural SVD learning theories while V contains the eigenvectors of Z∗Z with the appeared independently; then, as novel contribution same eigenvalues. to this field, we present: Here we consider four parallel SVD learning algo- rithms, which allow to simultaneously compute the A unifying view of the mentioned theories, show- • SVD vectors. The considered neural algorithms have ing the main relationships among them; been developed by Weingessel and Hornik32 and by A stability analysis based on Lyapunov criterion, • Helmke and Moore.18 These algorithms are utilized aimed at ensuring the non-divergence of the differ- to train in an unsupervised way a three-layer neural ential equations governing the learning phases of network with the classical ‘butterfly’ topology (see the SVD neural networks trained via the consid- e.g. Refs. 8, 32 and 33): The first layer has con- ered methods; nection matrix A, the second one has connection- A computer-based analysis of the learning differ- • matrix B and the middle (hidden) layer provides ential equations carried out in order to assess their network’s output. Properly learnt, the network is numerical properties. able to perform the mentioned signal/data process- Through the paper we use the following notation. ing tasks, such as noise filtering.8 Symbol Im,n denotes the pseudo-identity matrix of The aim of this section is to analytically show 0 size m n and Im = Im,n. Symbol X denotes that the algorithms proposed by Weingessel–Hornik × ∗ the transposition of the matrix X while X denotes and Helmke–Moore are equivalent to some extent. Hermitian-transposition; symbol tr(X) denotes the Also, it is showed that when proper initial condi- trace of the square matrix X, i.e. the sum of its tions are chosen the associated learning trajectories in-diagonal entries; the trace operator enjoys the lie on the double Stiefel manifold. following properties: tr(X 0) = tr(X), tr(ABC) = tr(CAB) = tr(BCA). We also define the two ma- 2.1. The WH2, WH3 and WH4 neural def 0 0 def trix operators X, Y = X Y Y X and [X, Y ] = SVD-subspace dynamical systems 0 0 { } − X Y +Y X. The following matrix set (termed Stiefel In Ref. 32 some new learning equations have been manifold) is also useful to our expository purposes: def m×n ∗ introduced by Weingessel and Hornik in order to St(m, n, K) = X K X X = In with m 1, { ∈ | } − compute the SVD-subspace of a given matrix. Here n 1 N; the field K may be either R or C; when − ∈ we investigate on three of them, expressed as m = n the manifold coincides with the orthogonal continuous-time differential equations. The deriva- K def Km×m ∗ group O(m, ) = X X X = Im . We tions presented below make use of the matrix { ∈ K | K } refer to the product O(m, ) O(n, ) as double differential calculus: A source reference for this is × K orthogonal group and to the product St(m, p, ) Ref. 23. K × St(n, p, ) as double Stiefel manifold (some defini- Let us denote as A(t) Rm×p the network- ∈ tions and notes on these geometrical entities are connection matrix-stream that should learn p left available in the appendix A.1). Also, the Frobe- singular vectors and as B(t) Rn×p the estimator Rn×n ∈ nius norm of a matrix X is defined as for p right singular vectors of the SVD of matrix def ∗ ∈ X F = tr(X X). Z Rm×n, with p r min m, n , where r de- k k ∈ ≤ ≤ { } p notes again the rank of matrix Z. The algorithm 2. Four Parallel SVD Learning WH232 reads:

Algorithms: A Unifying View 0 0 A˙ = ZB AB Z A , A(0) = A0 , m×n − (1) Denoting as Z C the matrix whose SVD is to ˙ 0 0 ∈ ( B = Z A BA ZB , B(0) = B0 . be computed and as r min m, n the rank of Z, − ≤ { } ∗ the singular value decomposition writes Z = UDV , It has been derived by extending Brockett’s work where U Rm×m and V Rn×n are orthogonal ma- on isospectral flow systems3 from single to double ∈ ∈ trices and D is a pseudo-diagonal matrix that has orthogonal group; the initial state A0, B0 of the Singular Value Decomposition Learning on Double Stiefel Manifold 3 dynamical equations may be freely chosen. Here conditions (2): we consider the particular choice A0 St(m, p, R) ∈ def 0 0 0 0 0 0 0 and B0 St(n, p, R), as for instance A0 = Im,p and S = A ZB = A (AB Z A) = (A A)B Z A ∈ B0 = In,p. = B0Z0A = S0 .

Theorem 1 Now, as A belongs to the Stiefel manifold St(m, p, R) If the initial states of the WH2 system belong to the at any time and has thus rank p, the equilibrium so- Stiefel manifold, then the whole dynamics is double- lution may be parameterized as A = UpKa; the same Stiefel. holds for B = VpKb, where Ka and Kb are matrices in O(p, R). This ensures A and B span the SVD- Proof subspace of Z. On the basis of this parameterization, the We wish to prove that if A0 St(m, p, R) and 0 0 0 ∈ product A ZB writes Ka(UpZVp)Kb, where, by def- B0 St(n, p, R), then A(t) St(m, p, R) and B(t) 0 ∈ ∈ ∈ inition, UpZVp = D1, the diagonal matrix of p St(n, p, R) for all t 0. 0 ≥ singular values. On the other hand, S = K D1K To show this for matrix A, it is sufficient to prove a b must be symmetric and this may hold only if K = that the trajectory emanating from any point such a 0 0 Kb = K.  that A A = Ip has differential d(A A) = 0. Since d(A0A) = (dA)0A + A0(dA) = (A˙0A + A0A˙)dt = This shows that the WH2 algorithm does not ac- [A,˙ A]dt, this may be proven by computing the brack- tually compute the true SVD, but a SVD-subspace ets [A˙, A]: of dimension p. [A˙, A] = B0Z0A A0ZBA0A + A0ZB A0AB0Z0A The WH4 learning system introduced in Ref. 32 − − 0 0 0 0 0 reads: = (Ip A A)B Z A + A ZB(Ip A A) − − 1 0 0 0 0 0 0 0 0 A˙ = ZB A(A ZB + B Z A) , A(0) = A0 , = [Ip A A, B Z A] = [0, B Z A] = 0, − 2 −  0 from which d(A A) = 0. In a similar way it can be  0 1 0 0 0  B˙ = Z A B(A ZB + B Z A) , B(0) = B0 , shown that B0 St(n, p, R) implies [dB, B] = 0, that − 2 ∈ (3) ensures B(t) St(n, p, R) for t 0.   ∈ ≥ whic h readily rewrites: The stationary points of the WH2 algorithm, 1 A˙ = ZB A[A, ZB] , A(0) = A0 , when the state matrices keep within the double − 2 Stiefel manifold, may be easily characterized. In fact,   0 1  B˙ = Z A B[A, ZB] , B(0) = B0 . we can state the following result: − 2  Theorem 2 Theorem 3

The steady states of WH2 learning system can be Under the hypotheses A0 St(m, p, R) and B0 ∈ ∈ written as A = UpK and B = VpK, where K is St(n, p, R) the learning equations WH4 keep A(t) and R arbitrary in O(p, ) and Up and Vp denote the sub- B(t) within the Stiefel manifold. matrices whose columns are p right and left singular vectors of the matrix Z, respectively. Proof

Proof In order to demonstrate the claim, let us compute [A,˙ A] and [B˙ , B]: From the WH2 learning equations we find that the 0 steady states satisfy: 2[A,˙ A] = [Ip A A, [A, ZB]] , − 0 0 0 0 0 ZB = AB Z A and Z A = BA ZB . (2) 2[B˙ , B] = [Ip B B, [A, ZB]] . − 0 0 0 At equilibrium, the product A ZB must be symmet- If A A = B B = Ip then it follows from the above ric. To prove this, it is sufficient to use the first of expression that [dA, A] = 0 and [dB, B] = 0, thus 4 S. Fiori

A(t) St(m, p, R) and B(t) St(n, p, R) for any t. The steady states satisfy: ∈ ∈ This proves the claim.  νZB = AA0ZB + AB0Z0A The WH3 learning system was derived as an ex- (6) 0 0 0 0 tension of well-known Oja’s subspace rule.26 The al- and νZ A = BA ZB + BB Z A . gorithm WH3 reads: def At equilibrium, the product S = A0ZB must be 0 0 0 A˙ = ZB A(A ZB + B Z A) , A(0) = A0 , symmetric. To prove this, it suffices to use the first 0 − 0 0 0 of conditions (6): ( B˙ = Z A B(A ZB + B Z A) , B(0) = B0 . − (4) 1 S = A0(ZB) = (A0A)(A0ZB + B0Z0A) The dynamical properties of system WH3 follow as ν a trivial corollary of Theorem 3: 1 0 = (S + S ) . 2 Corollary 1 This shows that 2S = S + S0 and thus that S = S0. Under the hypotheses A0/√2 St(m, p, R) and ∈ The conclusion now follows from the same B0/√2 St(n, p, R) the learning equations WH3 ∈  keep A(t)/√2 and B(t)/√2 within the Stiefel mani- argument of Theorem 2. fold. Moreover, WH3 is diffeomorphic to WH4. 2.2. The HM neural SVD dynamical Proof system def By defining auxiliary state-matrices Ax = √2A and The HM dynamics arises from the maximization of a def B = √2B, the system (4) turns out to be identical specific metric-criterion ΦW : O(m, C) O(n, C) x × → to (3).  R defined as:

def ∗ The structure of the stationary points of the ΦW (A, B) = 2 Re tr(W A ZB) , (7) WH3-4 algorithms is similar to the structure of the Rn×m equilibria of WH2 system. This is proven in the fol- where W is a weighting matrix and Z Cm×n ∈ ∈ lowing result: is the matrix whose (complex-valued) SVD is looked for, in the hypothesis that m n. The dy- ≥ Theorem 4 namical system, derived as a Riemannian gradient flow (see appendix A.2) on O(m, C) O(n, C), reads: The steady states of WH3 and WH4 learning systems × ∗ ∗ ∗ ∗ write A = UpK and B = VpK, where K is arbitrary A˙ = A(W B Z A A ZBW ) , A(0) = A0 , − in O(p, R) and Up and Vp denote the sub-matrices ∗ ∗ ∗ ∗ ( B˙ = B(W A ZB B Z AW ) , B(0) = B0 . whose columns are p right and left singular vectors − (8) of the matrix Z, respectively. By construction it holds A(t) O(m, C) as well as ∈ B(t) O(n, C). Proof ∈ In the particular case that W = In,m and the − The WH3 and WH4 learning equations may be given involved quantities are real-valued, system (8) re- a unified expression in the following way: casts into:

0 0 A˙ = ZBIn,m AIm,nB Z A , ˙ 1 0 0 0 − A = ZB A(A ZB + B Z A) , A(0) = A0 , 0 0 (9) − ν ( B˙ = Z AIm,n B(AIm,n) ZB .  −  ˙ 0 1 0 0 0  B = Z A B(A ZB + B Z A) , B(0) = B0 Such simplified system is equivalent to WH2 when  − ν  p = n; in order to prove such statement, it is 0 0 ν A A = B B = Ip , first worth noting from the second equation of the  2  (5) system (9) that the last m n columns of A  − where ν = 1 for the WH3 and ν = 2 for the WH4. do not influence the dynamics of B; thus, it is Singular Value Decomposition Learning on Double Stiefel Manifold 5

def worth defining the reduced-size matrix A = AI manifold of dimension m n, thus its numeri- n m,n × and to note that A˙ n = AI˙ m,n = ZBIn,mIm,n cal representation is more advantageous under the 0 0 − (AIm,n)B Z (AIm,n); thanks to the hypothesis m considered hypothesis n m. ≥  n it is directly verified that In,mIm,n = In, therefore the system (9) recasts into: 3. Theoretical Considerations 0 0 A˙ n = ZB AnB Z An , − This section is dedicated to the statement and ˙ 0 0 proof of some theoretical results about the behav- ( B = Z An BAnZB , − ior of Weingessel–Hornik and Helmke–Moore learn- whereby the equivalence with the algorithm WH2 ing systems. when p = n. The above analysis shows that the Weingessel– Hornik SVD learning equations may be regarded as 3.1. Derivation of WH equations special cases of Helmke–Moore system; in particular, As mentioned, the WH2 equations arise as a spe- this is an indirect proof that the choice of the ele- cial case of HM equations, therefore the derivation ments of the weighting kernel W as a pseudo-identity of WH2 equations are implicitly considered in the makes the SVD algorithm a SVD-subspace rule. A Sec. 3.2. consequence of these findings is that the properties Weingessel and Hornik derived the WH3 learning of the above mentioned learning rules may be given rule from the Oja’s principal subspace equation,26 a unified investigation, that is the subject of the fol- that, for a (m+n) p network with connection matrix × lowing section. M, writes: As a useful side-note, it is worth citing the opportunity to modify the HM system (8) when the 0 M˙ = (Im+n MM )CM , (11) ratio m/n is much larger than 1: In this case, from − a numerical point of view, it is convenient to com- where C is a (m+n) (m+n) covariance matrix. By × pute the thin SVD of a matrix Z instead of the relating the (m n) covariance Z that a SVD sub- regular SVD.15 In the hypothesis that Z Cm×n × ∈ space is sought for with C and by effecting a proper and n coincides with the rank of Z, the thin SVD block-decomposition of state-matrix M into A and of Z is defined as the triple (Un, Dn, V ) such that B, the WH3 learning rule is easily obtained from n×n Un St(m, n, C), Dn R diagonal, V O(n, C) ∈ ∗ ∈ ∈ Oja’s subspace rule, as shown in the following result. and Z = UnDnV . In this case, the HM system (8) easily recasts into Theorem 5 a more compact form as: 0m Z ∗ Cn×n (Ref. 33.) Let us define C = 0 and M = H = A ZB , Z 0n ∗ ∈   A˙ = A(W H HW ), A(0) = A0 St(m, n, C), A  − ∈ . Then Eq. (11) is equivalent to system (4).  ∗ ∗ B  B˙ = B(W H H W ), B(0) = B0 O(n, C),   − ∈ (10)  Proof with W Rn×n diagonal. In this case, the neural- ∈ network state-matrix A evolves on the Stiefel By replacing the expressions for C and M into Eq. (11) we obtain:

A˙ Im 0m,n A 0 0 0m Z A = [A B ] 0 B˙ 0n,m In − B Z 0n B " #          0 0 Im AA AB ZB = − 0 − 0 0 BA In BB Z A  − −    0 0 0 (Im AA )ZB AB Z A = −0 − 0 0 . (12) BA ZB + (In BB )Z A " − − # 6 S. Fiori

By separating the two differential equations and by 3.2. Derivation of HM equations on properly regrouping the terms on the right-hand Stiefel manifold with Killing metric sides, the WH3 learning system reported in this pa- The rationale of the Helmke–Moore criterion Φ de- per is readily obtained.  rives from the basic observation that the aim of SVD is to diagonalize A0ZB, that in a signal processing It is interesting to observe that the WH3 system perspective means minimizing the values of covari- inherits the known and noticeable properties of Oja’s ance among the signals that Z is the cross-covariance subspace equations, such as the Riccati structure of matrix of. Fixed thus an arbitrary diagonal matrix the projector MM 0. Namely, by defining: H0, this result may be achieved by minimizing, un- 0 0 der proper constraints, the “non-diagonality” mea- def AA AB 0 2 P = 0 0 , (13) sure A ZB H0 . However, the following identity BA BB k − kF " # holds:18 it is easy to show that P satisfies the differential 0 2 0 2 A ZB H0 = tr(ZZ ) + H0 equation: k − kF k kF

ΦH0 (A, B) , P˙ = CP + P C 2P CP , (14) − − that is a special kind of Riccati differential thus minimizing the non-diagonality measure is equation.29 equivalent to maximizing the function (7). It is un- Another interesting observation is that Oja’s cri- derstood that the optimization process should be terion, that leads to the associated subspace rule, in- performed over O(m, R) O(n, R). In the real-valued × duces a criterion on the pair (A, B) that is a special case under consideration here, also the well-known case of the HM criterion. To show this implication, (weighted) Rayleigh quotient (RQ) may be invoked, it is worth recalling the following: which is defined as: 0 def tr(A ZBW ) Lemma 1 RW (A, B) = . (16) A F B F k k k k (Ref. 26.) Oja’s subspace rule (11) arises from the optimization of the criterion tr[M 0CM] under the As long as A and B belong to the orthogonal group, 2 2 constraint M St(m + n, p, R). that implies A F = m and B F = n, the identity ∈ k k k k 2√mnRW (A, B) = ΦW (A, B) holds. Having recalled this basic fact, we can state the 0 In any case, the quantity tr(A ZBW ) is a start- mentioned equivalence result: ing point for developing a suitable SVD learning the- ory, generating HM-type and WH2-type differential Theorem 6 equation systems. Oja’s criterion for the block-pair (A, B) is identi- In order to derive the Riemannian-gradient flows cal to HM criterion for the real-valued case when HM or WH2, both in Refs. 32 and 18 the technique 3 W = In,m. proposed by Brockett was used, which involves the − first-order expansion of the dynamics of A(t) and Proof B(t) in series of skew-symmetric matrices. Here we aim at re-deriving the HM equations (for the real- By invoking again the block-partition of Theorem 5, valued orthogonal group) in a different way, following we have: the more straightforward Riemannian-gradient ap- proach suggested by Amari,1,4 based on the geometry 0 0 0 0m Z A tr(M CM) = tr [A B ] 0 of the Stiefel manifolds. " Z 0n # " B #! 0 0 0 = tr(A ZB + B Z A) . (15) Theorem 7

def 0 It follows that tr(M 0CM) = 2 tr(A0ZB), which Let us define H = A ZB. The gradient-based def proves the claim.  maximization of the objective function Φr(H) = Singular Value Decomposition Learning on Double Stiefel Manifold 7 tr(W H), where A O(m, R), B O(n, R), W search for the steepest ascent directions ∆X and ∆Y , × ×∈ ∈ ∈ Rn m and Z Rm n, with spaces endowed with the that, by definition, are the variations which maxi- ∈ Killing metric, gives rise to the following dynamical mize the change ∆Φr under finite-step-length con- system: straints, namely ∆X 2 = ε2 > 0 and ∆Y 2 = k k x k k ε2 > 0. To this aim, we need to specify the norm ˙ 0 0 0 y A = ZBW + AW B Z A , ; in this case we use the standard Euclidean metric − 0 0 0 (17) k·k ( B˙ = Z AW + BW A ZB . on the tangent space, that is the Killing metric (see − appendix A.3), with which the constraints rewrite 0 2 0 2 Proof tr(∆X ∆X) = εx and tr(∆Y ∆Y ) = εy. In order 0 to enforce the mentioned constraints, the standard The learning criterion function is Φr = tr(W A ZB). Lagrange multipliers method may be employed, that A perturbation (dA, dB) of neural network state consists in the definition of the Lagrangean function: (A, B) causes a change dΦr. In particular, up to first order: def 0 0 = tr( W, H ∆X) + tr( W , H ∆Y ) 0 L { } { } Φr(H + dH) = tr(W (A + dA) Z(B + dB)) 0 2 + λx(tr(∆X ∆X) ε ) 0 − x = Φr(H) + tr(W dA ZB) 0 2 + λy(tr(∆Y ∆Y ) ε ) , 0 y + tr(W A ZdB) , − therefore, by exploiting the properties of the ele- whose free extremes may be looked for. They find by: ments of the orthogonal groups and of trace operator, we have: ∂ 0 L = W, H + 2λx∆X = 0 , dΦ (H) = tr(W dA0(AA0)ZB) + tr(W A0Z(BB0)dB) ∂∆ { } r X 0 0 ∂ 0 = tr(W dA AH) + tr(W HB dB) L = W , H + 2λy∆Y = 0 . 0 0 ∂∆ { } = tr(HW A dA) + tr(W HB dB) . (18) Y − Therefore the steepest ascent variations express as It is now useful to introduce the differentials 0 0 def def dX W, H and dY W , H . Coming back dX = A0dA and dY = B0dB, which form a ba- ∝ { } ∝ { } to the original variables in the orthogonal groups sis of the tangent space to O(m, R) at A and to 0 0 we have A˙ = A W, H and B˙ = B W , H , which O(n, R) at B. The tangent spaces are linear spaces { } { } proves the claim.  and, moreover, they are the sets of proper-size skew- The shown learning system coincides to HM sys- symmetric matrices; in fact, we have dX 0 = dX 0 − tem in the real-valued case and also explains the and dY = dY . In view of optimization, these − WH2 learning theory. properties must be preserved. A way to preserve the structure of the tangent space is to note that: 3.3. Stability analysis of HM equation dΦr(H) = tr(HW dX) + tr(W HdY ) on orthogonal group via − 0 0 0 0 0 0 Lyapunov function = tr(dX W H ) + tr(dY H W ) − 0 0 0 0 One of the main theoretical advantages of the learn- = tr(dXW H ) tr(dY H W ) − ing systems on Stiefel manifolds is their inherent 0 0 0 0 = tr(W H dX) tr(H W dY ) . (19) stability,9 due to the compactness of these sets. In − the present case, the convergence of the HM sys- By summing hand-by-hand Eqs. (18) and (19) we tem may be proven by showing that ΦW (A, B) is ultimately obtain: a Lyapunov-type function for the HM system. 0 0 More formally, let us denote by Φmin the minimal 2dΦr(H) = tr( W, H dX) + tr( W , H dY ) . (20) { } { } value of the function in O(m, R) O(n, R); note that × As our aim is to find directions dA and dB that it exists since ΦW is a continuous function defined on point toward the maximum of function Φr, we now a compact manifold. 8 S. Fiori

Theorem 8 assessing the qualitative behavior of the discussed Let us define the (lifted-criterion) time-function: learning equations. It is worth noting that the discrete-time im- def 0 Ψ(t) = tr(A (t)ZB(t)W ) Φmin . (21) plementations used to numerically solve the ODEs − associated to the learning systems introduce some It is a Lyapunov function for the system (17). deviations with respect to the theoretical findings: Whereas the continuous-time versions of the learn- Proof ing algorithms leave the double Stiefel-manifold def By construction Ψ(t) 0. Also, by letting H(t) = and orthogonal-group invariant, this is not nec- 0 ≥ A (t)ZB(t) and by using Eqs. (17) it is found: essarily true for their discrete-time counterparts. These aspects deserve a separated treatment and are ˙ ˙0 0 ˙ Ψ = tr(A ZBW + A ZBW ) addressed in the last section. = tr(W 0H0HW + W HH0W 0 − 2HW HW ) . (22) 4.1. Numerical experiments on − SVD-subspace extraction by Some mathematical work shows that the follow- WH2-3-4 algorithms ing identities hold true: We performed some experiments with the W, H0 2 = tr( 2W 0H0W H + 2HW HW ) , Weingessel–Hornik algorithms, in order to numer- −k{ }kF − 0 2 0 0 ically evaluate their behavior. W , H F = tr(2W HH W 2W HW H) . −k{ }k − Let us denote again with Z the m n matrix × In virtue of the these results, we may rewrite the whose SVD-subspace of dimension p is looked for right-hand side of expression (22) as follows: and let us denote with (U, D, V ) the matrices of left- singular vectors, singular values and right-singular 1 Ψ(˙ t) = ( W, H0(t) 2 vectors. The extraction of a p-dimensional SVD sub- −2 k{ }kF space implies that the columns of matrices A and 0 2 + W , H(t) ) 0 . (23) B in the WH algorithms should span, after con- k{ }kF ≤ vergence, the same subspace spanned by the first p Such inequality proves the claim.  columns of U and V , respectively, in the hypothesis The structure of the steady-state solutions of the HM that the singular values are decreasingly ordered. Let us denote by U and V the sub-matrices of U and V system has been studied in details by Helmke and p p containing their first p columns: A proper measure of Moore and has been reported in Ref. 18. The con- the SVD-subspace extraction ability of the network vergence property of the HM system towards such is the SVD-subspace disparity error pair, defined in steady-states now follows immediately. Ref. 32 as:

Corollary 2 0 def UpUpA A 2 The HM learning system (17) converges ε(A) = k − k , A 2 asymptotically. k k 0 (24) def VpV B B 2 ε(B) = k p − k . Proof B 2 k k The existence of a Lyapunov function for the dynamical system under analysis, proven by The- As a numeric problem we considered the case orem 8, guarantees the asymptotic stability of its m = 8, n = 6 and p = 3. The numerical results equilibria.16  have been obtained by randomly picking a matrix Z with normal Gaussian entries and by computing the disparity errors ε(A) and ε(B) at each iteration, 4. Numerical Experiments for a total of 8, 000 iterations. This experiment is re- Some numerical experiments are described and peated on 100 independent trials: The average learn- commented in the following sections. They help ing curves are presented. Also, if we denote with Singular Value Decomposition Learning on Double Stiefel Manifold 9

1 1

0.8 0.8

0.6 0.6

0.4 0.4 (A) average (B) average ε ε

0.2 0.2

0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iterations Iterations

30 30

25 25

20 20

15 15

Distribution 10 Distribution 10

5 5

0 0 −300 −200 −100 0 −300 −200 −100 0 ε(A ) [dB] ε(B ) [dB] * * Fig. 1. Values of average disparity errors and estimation of disparity errors distribution after learning for the WH2 algorithm.

1 1

0.8 0.8

0.6 0.6

0.4 0.4 (A) average (B) average ε ε

0.2 0.2

0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iterations Iterations

25 30

20 25 20 15 15 10 Distribution Distribution 10

5 5

0 0 −300 −200 −100 0 −300 −200 −100 0 ε(A ) [dB] ε(B ) [dB] * *

Fig. 2. Values of average disparity errors and estimation of disparity errors after learning for the WH3 algorithm. 10 S. Fiori

ε(A?) and ε(B?) the final values of the errors Up-subspace and the Vp-subspace. We also checked at the end of the iterations (which measure the the orthonormality of the solutions A? and B? and learnt-network performance), the statistical distri- found that they are orthonormal, with respect to the 0 0 butions of these quantities over the 100 trials is measures A?A? Ip F and B?B? Ip F, up to − k − k k − k estimated. order 10 15. For the WH2 algorithm the initial values of the For the WH3 algorithm the initial values of the connection-matrices were A0 = Im,p and B0 = In,p connection-matrices were A0 = Im,p/√2 and B0 = and the value of the learning stepsize was η = 0.005. In,p/√2 and the value of the learning stepsize was The results of the numerical experiments are illus- η = 0.005. The results of the numerical experiments trated in the Fig. 1. The average behavior of the al- are shown in the Fig. 2. Again, the average behav- gorithm is very good. No instabilities were observed ior of the algorithm is very good and no instabilities and in the largest part of trials the value of the error were observed. It is interesting to inspect the nu- at the end of learning is very low: The 97% of the merical results for a single-case trial. We consider trials gave an error less than 50 dB both for the the random matrix (only four decimal digits): −

1.4283 0.8074 1.1283 1.0507 0.6424 2.1022 − − 1.6218 0.5095 0.1493 0.3916 0.8243 0.5004  − −  1.4791 0.1503 0.3535 0.7533 1.6819 1.1130  − −   0.9191 0.3345 0.3409 2.8772 0.5035 1.3955    Z =  − − − −  ;  0.4135 1.4989 0.0959 0.1144 0.1325 0.6744   − − − −   0.4591 0.4383 2.0674 1.0988 0.4373 1.1486     − − −   1.4654 0.9846 0.1393 2.9265 0.3399 0.8313   − − − −   0.5261 0.2351 0.5403 0.1352 1.1526 0.4636     − − −  for this covariance matrix the algorithm has com- puted the following left singular vectors (only four For the WH4 algorithm the initial values of the decimal digits): connection-matrices were A0 = Im,p and B0 = In,p and the value of the learning stepsize was η = 0.006. 0.4385 0.4101 0.1641 The results of the numerical experiments are illus- 0.0668 0.3479 0.1297  −  trated in the Fig. 2. The results are as expected. We 0.2208 0.2714 0.2575  −  again checked the orthonormality of the solutions  0.2818 0.0320 0.3900  A? =  − −  A? and B? and found that they are orthonormal  0.0350 0.1621 0.0019  −15  − −  up to order 10 . We also checked for the “sym-   0  0.0955 0.0248 0.4195  metrization” property of the matrix-product A ZB  − − −   0.3997 0.3052 0.1590  and found that, at the end of learning, it was (only  −   0.0695 0.1261 0.1906  four decimal digits):    − −  and right singular vectors (only four decimal digits): 3.1045 0.3523 0.0450 0 − 0.0196 0.5579 0.2052 A ZB? = 0.3523 3.2290 0.0278 . − − ?  −  0.1994 0.1046 0.1263 0.0450 0.0278 3.0668  − −   − −  0.2354 0.0735 0.3370   B? =   . This result comes from a single trial: As expected  0.5947 0.0994 0.3394 0  − −  from the theory (see Theorem 4), the form A ZB is  0.0388 0.2870 0.1789  − −  symmetrical but not diagonal, confirming the ana-  0.2219 0.2832 0.4256    lytical finding that the pair (A, B) is just a base of   In this experiment the subspace disparity errors were the SVD subspace, not a proper singular-value de- about 270 dB. composition. − Singular Value Decomposition Learning on Double Stiefel Manifold 11

1 1

0.8 0.8

0.6 0.6

0.4 0.4 (A) average (B) average ε ε

0.2 0.2

0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iterations Iterations

35 30

30 25 25 20 20 15 15

Distribution Distribution 10 10

5 5

0 0 −300 −200 −100 0 −300 −200 −100 0 ε(A ) [dB] ε(B ) [dB] * * Fig. 3. Values of average disparity errors and estimation of disparity errors after learning for the WH4 algorithm.

4.2. Numerical experiments on SVD Second, it is interesting to inspect the value of the extraction by HM algorithm criterion function Φ(A, B) = 2 tr(W A0ZB) during learning and to compare its asymptotic value with We performed some experiments with the Helmke– the optimum Φ = 2 tr(W U 0ZB). Moore algorithm, in order to evaluate numerically its ? Third, we known that the HM learning principle behavior. is defined in order to diagonalize the product-matrix Again Z denotes the m n matrix whose SVD is × A0ZB, therefore an interesting error measure is the sought for and (U, D, V ) denote the matrices of left- norm of the off-diagonal part of that matrix; namely singular vectors, singular values and right-singular we may define a corresponding index as: vectors. As in the theoretical sections, we consider m n. As indicators of the behavior of the algo- def 0 ≥ δ(A, B) = offdiag(W A ZB) F , (26) rithms, we consider the following measures. k k First, if An, Bn, Un and Vn denote the sub- with clear meaning of the symbols. matrices formed by the first n columns of the SVD Fourth, it is also extremely interesting to measure and network matrices, it is known that the columns the deviation from orthonormality of the connection of An should tend to the columns of Un, while the matrices. This may be achieved by the help of the columns of Bn should tend to the columns of Bn, indices: ordered in the same way but with a possible sign def 0 switch for every column; therefore, a proper measure n(A) = A A Im F , k − k (27) of (A, B) convergence is: def 0 n(B) = B B In F . def k − k ε(Ap) = Up Ap F , k| | − | |k (25) def As a numeric problem, we considered the case ε(Bp) = Vp Bp F , k| | − | |k m = 8, n = 3. The numerical results have been where X stands for component-wise absolute-value obtained by randomly generating a matrix Z by com- | | extraction. puting the above-described indices at each iteration, 12 S. Fiori

Estimation errors Learning criterion 2.5 50

2 40

1.5 30 B ε , (A,B) A ε 1 Φ 20

0.5 10

0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iterations Iterations

Deviation from normality Diagonalization −30 7

−40 6 5 −50 4 [dB]

−60B (A,B) , n δ 3 A n −70 2

−80 1

−90 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iterations Iterations Fig. 4. Values of performance indices for the HM algorithm averaged over 100 independent trials. (Solid: Index of matrix A; Dashed: Index of matrix B).

for a total of 8, 000 iterations. This experiment was σ2 = 3.5342 and σ3 = 1.6557, while the diagonal part repeated on 100 independent trials in order to show of A0ZB is diag(5.2941, 3.5446, 1.6587); this shows average learning curves. It is important to note that the estimation ability of the HM algorithm is that, in order to compare the values of the func- quite good. tion Φ over different trials, both the initial states The last experiment concerns numerical analy- and the singular values of Z must keep constant. sis of stability: In this case, the matrix Z suddenly So we first generated randomly a diagonal matrix changes in the middle of iteration (but for the singu- D with n non-null entries on the diagonal that keep lar values which keep constant). The results of 100 constant over the whole trial set, and then for each independent trials are illustrated in the Fig. 5. They trial a pair of orthogonal matrices (U, V ) of proper pertain to the same set of parameter-values of the size were randomly generated; then Z computes preceding experiment. When the matrix Z changes 0 as UDV . the performance indices present a peak but they re- The results of the experiments are illustrated in turn rapidly to the satisfactory asymptotic values. the Fig. 4. They pertain to the following parameter- values: η = 0.001, A0 = Im and B0 = In; also, − 4.3. Discussion on learning equations the top-left diagonal part of the weighting matrix W implementation is diag(3, 2, 1). As it is readily seen, the numerical results are very good. In this section we try to get the numerical simulation As a single-trial result, it is interesting to in- results into the right picture by expanding briefly the spect the singular-value estimation ability of the longer discussion on this topic already appeared in 12,13 algorithm: In an experiment it was σ1 = 5.2506, the recent contributions. Singular Value Decomposition Learning on Double Stiefel Manifold 13

Diagonalization 60 7

50 6

40 5

30 4 (A,B) (A,B) δ Φ 20 3

10 2

0 1

−10 0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 4 4 Iterations x 10 Iterations x 10 Fig. 5. Values of performance indices for the HM algorithm averaged over 100 independent trials when the right and left singular vectors change during iteration. (Solid: Index of matrix A; Dashed: Index of matrix B).

In general, there are several possibilities of im- as the ones relying on the Lie–Euler method or on plementing a learning algorithm. In this paper, the second-order methods such as the Lie–Runge–Kutta results are obtained for the continuous-time case technique.12 but the simulations are performed for the off-line In the present case, we clearly dealt with short- discrete-time version, related to using a simple Euler integration-time learning processess that do not need approximation for solving the associate ODEs. This such complicated integration schemes to be adopted, opens room to at least one question: How does the as the small values of the chosen learning stepsizes used integration algorithm affect the relevance of the ensure good convergence in a reasonably small num- obtained numerical results? ber of iterations. This claim is confirmed e.g., by We believe a simple yet convincing answer comes the values of the orthogonality measure reported for from the consideration about the integration time: every simulation which shows that the degree of ad- We classify a learning process into short-integration- herence to the invariant is excellent (up to order time learning, that requires few adaptation steps 10−15). This consideration suggests that the results to get a satisfactory connection pattern, and long- of the simulations for the discrete-time case are a le- integration-time learning, that involves the solution gitimate mean for illustrating the theoretical results of the differential system for a long time-interval obtained for the ODEs. to obtain a satisfactory result or to tackle a non- stationary signal processing problem, for instance. The quality of the solution of the continuous-time 5. Conclusions learning equations may be heavily affected by the selected integration scheme only in long-integration- The aim of this paper was to present a unify- time learning processes: In this case it is not nor- ing view of closely-related parallel SVD-computation mally possible to select small learning stepsizes be- algorithms by neural networks learning on double cause this would cause an excessive computational Stiefel manifold. After showing a new derivation of burden, thus the learning equations are sampled with HM theory based on Riemannian-gradient on double- relatively high step-sizes; in this case, however, the orthogonal group endowed with the Killing metric, Euler method may fail in finding accurate state-space some important properties of the algorithms have trajectories or in fulfilling the constrains (such as or- been investigated. Particularly, a suitable Lyapunov thonormality, in the present case) and more com- stability criterion has been constructed to prove plicated integration techniques should be used, such asymptotic convergence. 14 S. Fiori

−1 Some numerical experiments also helped assess- −1 A1 0 and a = − . This proves the claim. 0 A 1 ing the qualitative behavior of the discussed learning  2  equations. Let us also recall the definition of smooth mani- fold and argument on the manifold-structure of the Cartesian product of two of such geometric entities. Acknowledgments I am definitely indebted with the anonymous Definition 2 Reviewer whose careful comments and detailed sug- (Ref. 19.) Let M be a topological space. A chart of gestions greatly helped to improve the quality and M is a triple (U, φ, n) consisting of an open subset U the clarity of the manuscript. of M, an homeomorphism φ of U onto Rn and the dimension n of the chart. A. Appendix Two charts (U, φ, n) and (V, ψ, m) of M are termed C∞ compatible charts if either U V is empty The aim of the present appendix is to recall some ∩ − or φ(U V ) and ψ(U V ) are empty sets and ψ φ 1 definitions from differential geometry and to justify ∩ − ∩ ∞ ◦ as well as φ ψ 1 are C maps. some notational choices within the paper. ∞ ◦ A C atlas of M is a set = (Ui, φi, ni) i ∞ A { | ∈ I} of C compatible charts such that M is completely A.1. Double Stiefel-manifold and covered by the union of the Ui’s. An atlas is maxi- ∞ A orthogonal-group mal if every chart of M which is C compatible with every chart of also belongs to . Let us recall the definition of group and argument on A A the group-structure of the Cartesian product of two A smooth manifold M is a topological Hausdorff orthogonal groups. space endowed with a countable basis and equipped with a maximal C∞ atlas. If all the coordinate charts Definition 1 of M have the same dimension n then the manifold is said to have dimension n. (Ref. 19.) A group is a structure (G, ) formed by  a set G and an operation that associates to every On the basis of the above-recalled definitions,  pair (a, b) G G a unique element a b. The it is possible to show that the Cartesian product ∈ ×  operation satisfies three axioms: (1-Associativity) of two smooth manifolds is a manifold itself. The (a b) c = a (b c), (2-Existence of a neutral argument follows from the observation that if M     element) there exists some e G such that e a = a and N are two smooth manifold, then any two ∈  for all a G and (3-Existence of the inverse) for all charts (U, φ, n) and (V, ψ, m) of M and N define ∈ − a G there exists an element a 1 G such that a chart (U V, φ ψ, n + m) of M N. There- −∈1 ∈ × × × a a = e. fore, M N has the structure of a smooth manifold  × of dimension m + n 19. The compact Stiefel man- Within the paper the matrix-set O(n, K) has been ifold is a smooth manifold, thus the product space invoked frequently. It is easy to show that it is a St(m, n, K) St(p, q, K) is a manifold itself. group. In fact, by identifying the operation with ×  the standard matrix product, it is easily verified that A.2. Riemannian gradient flow if a, b, c O(n, K) then (ab)c = a(bc), e = In and ∈ −1 −1 there exists an element a such that a a = In. As mentioned, the HM system may be derived as a Let us now consider the Cartesian product G = gradient-flow on a Riemannian manifold. In order O(n, K) O(m, K) and let us show that it is actually to clarify the relationship with the presented the- × a group (of dimension m+n). It is worth considering ory, it is useful to recall the definition of Riemannian A1 0 ∗ gradient flow. the representation a = , with A1A1 = In 0 A2 ∗   and A A2 = Im and again the identification of Definition 3 2  with the matrix product. Then, the associativity (Ref. 19.) Let M be a smooth manifold. A Rie- ∗ property holds, it is readily verified that a a = In+m mannian metric on M is a family of non-degenerate Singular Value Decomposition Learning on Double Stiefel Manifold 15 inner products, which are functions of the point on algebra so(n, R), namely tr[X 0X], because it is a the manifold, defined on the tangent space to each linear space. This expression has the same struc- point on the manifold, such that it depends smoothly ture of a Killing form because the elements of so are on the point. When a Riemannian metric is speci- skew-symmetric matrices, thus we may identify the fied, M is termed Riemannian manifold. Euclidean metric on so with the Killing metric on Let Φ: M R be a smooth function defined on a the same space (up to an inessential constant). → Riemannian manifold M. The gradient vector field grad Φ of the function with respect to the selected References metric is uniquely characterized by two conditions, 1. S.-I. Amari 1998, “Natural gradient works efficiently referred to as tangency and compatibility. in learning,” Neural Computation 10, 251–276. 2. H. Bourlard and Y. Kamp 1988, “Auto-association On the basis of these definitions, it is possible to by multilayer perceptrons and singular value decom- recall the concept of Riemannian gradient flow on position,” Biological Cybernetics 59, 291–294. a Riemannian manifold M as x˙(t) = grad Φ(x(t)), 3. R. W. Brockett 1991, “Dynamical systems that sort with x(t) M. lists, diagonalize matrices and solve linear program- ∈ ming problems,” Linear Algebra and Its Applications With reference to the HM algorithm, it is worth 146, 79–91. noting that it has been studied over the double or- 4. T.-P. Chen, S.-I. Amari and Q. Lin 1998, “A unified thogonal group. However, the orthogonal group is algorithm for principal and minor component extrac- one of the classical Lie groups, which possess the tion,” Neural Networks 11, 385–390. noticeable property to be also manifolds. There- 5. A. Cichocki and R. Unbehauen 1992, “Neural net- works for computing eigenvalues and eigenvectors,” fore, with the arguments of the above appendix, the Biological Cybernetics 68, 155–164. double orthogonal group is also a manifold. This 6. S. Costa and S. Fiori 2001, “Image compression using gives the connection between the considered crite- principal component neural networks,” Image and rion function and the gradient-flow structure of HM Vision Computing Journal (special issue on “Arti- equations. ficial Neural Network for Image Analysis and Com- puter Vision”), 19(9–10), 649–668. 7. E. F. Deprette (ed.) 1988, SVD and Signal Process- A.3. Killing form and metric ing (Amsterdam, Elsevier Science). 8. K. I. Diamantaras and S.-Y. Kung 1994, “Cross- In order to clarify the concepts underlying the in- correlation neural network models,” IEEE Trans. on voked Killing metric, it is useful to first recall some Signal Processing 42(11), 3218–3223. notation from Lie algebra theory. 9. S. Fiori 2001, “A theory for learning by weight flow on Stiefel-Grassman manifold,” Neural Computation 13 Definition 4 (7), 1625–1647. 10. S. Fiori 2002, “A theory for learning based on rigid (Refs. 14, 19, 20.) Let us denote with so(n, R) bodies dynamics,” IEEE Trans. on Neural Networks 13 the set of real-valued skew-symmetric matrices of (3), 521–531. 11. S. Fiori 2002, “Complex-weighted one-unit ‘Rigid- dimension n. Bodies’ learning rule for independent component Given a Lie algebra g and two elements of the analysis,” Neural Processing Letters 15(3), 275–282. algebra X and Y , the adjoint operator associated 12. S. Fiori 2002, “Unsupervised neural learning on to X as a function of Y is defined as adX (Y ) = Lie group,” International Journal of Neural Systems XY Y X. By definition, the adjoint operator is 12(3 & 4), 219–246. − skew-symmetric, thus it belongs to a Lie-algebra so. 13. S. Fiori, “A minor subspace algorithm based on neu- ral stiefel dynamics,” International Journal of Neural The Killing form is an inner-product on a finite- Systems 12(5), 339–350. dimensional Lie algebra defined by K(X, Y ) = 14. W. Fulton and J. Harris 1991, Representation Theory tr[adX adY ]. (New York: Springer-Verlag) 15. G. H. Goulb and C. F. van Loan 1996, Matrix Com- It is known that an inner-product defines a putations (The John Hopkins University Press, third metric, thus the Killing form defines the Killing edition). 16. W. Hahn 1963, Theory and Application of Lya- metric K(X, X) on a Lie algebra. In the paper punov’s Direct Method (Englewood Cliffs, New we used the standard Euclidean metric in the Lie Jersey: Prentice-Hall) 16 S. Fiori

17. S. Haykin 1991, Adaptive Filter Theory (Prentice- 27. T. D. Sanger 1994, “Two iterative algorithms for Hall). computing the singular value decomposition from in- 18. U. Helmke and J. B. Moore 1992, “Singular value de- put/output samples,” in J. D. Cowan, G. Tesauro composition via gradient and self-equivalent flows,” and J. Alspector (eds), Morgan-Kauffman Publish- Linear Algebra and its Applications 169, 223–248. ers Inc., Advances in Neural Processing Systems 6, 19. U. Helmke and J. B. Moore 1993, Optimization and 1441–151. Dynamical Systems (Springer-Verlag, Berlin) 28. M. Salmeron, J. Ortega, C. G. Puntonet and 20. N. Jacobson 1979, Lie Algebras (New York: Dover). A. Prieto 2001, “Improved RAN sequential predic- 21. A. K. Jain 1989, Fundamentals of Digital Image Pro- tion using orthogonal techniques,” Neurocomputing cessing (Englewood Cliffs, NJ: Prentice-Hall) 41(1–4), 153–172. 22. W.-S. Lu, H.-P. Wang and A. Antoniou 1990, 29. T. Sasagawa 1982, “On the finite escape phenomena “Design of two-dimensional FIR digital filters by for matrix Riccati equations,” IEEE Trans. on Au- using the singular value decomposition,” IEEE tomatic Control AC-27(4), 977–979. Trans. on Circuits and Systems CAS-37, 35–46. 30. S. T. Smith 1991, “Dynamic system that perform the 23. J. R. Magnus and H. Neudecker 1988, Matrix Differ- SVD,” Systems and Control Letters 15, 319–327. ential Calculus With Applications in Statistics and 31. R. Vaccaro (ed.) 1991, SVD and Signal Processing II: Econometrics (Wiley Series in Probability and Math- Algorithms, Analysis and Applications (Amsterdam, ematical Statistics, John Wiley & Sons). Elsevier Science) 24. B. C. Moore 1981, “Principal component analysis 32. A. Weingessel and K. Hornik 1997, “SVD algorithms: in linear systems: Controllability, observability and APEX-like versus subspace methods,” Neural Pro- model reduction,” IEEE Trans. on Automatic Con- cessing Letters 5, 177–184. trol AC-26(1), 17–31. 33. A. Weingessel 1999, An Analysis of Learning Algo- 25. O. Nestares and R. Navarro 2001, “Probabilistic rithms in PCA and SVD Neural Networks (Ph.D. estimation of optical flow in multiple band-pass Dissertation, Technical University of Wien, Austria). directional channels,” Image and Vision Computing Journal 19(6), 339–351. 26. E. Oja, “Neural networks, principal components and subspaces,” International Journal of Neural System 1, 61–68.