Singular Value Decomposition Learning on Double Stiefel Manifold

International Journal of Neural Systems, Vol. 13, No. 2 (2003) 1{16 c World Scientific Publishing Company SINGULAR VALUE DECOMPOSITION LEARNING ON DOUBLE STIEFEL MANIFOLD SIMONE FIORI Faculty of Engineering, Perugia University, Loc. Pentima bassa, 21, I-05100 Terni (Italy) [email protected] Received 25 October 2002 Revised 5 March 2003 Accepted 5 March 2003 The aim of this paper is to present a unifying view of four SVD-neural-computation techniques found in the scientific literature and to present some theoretical results on their behavior. The considered SVD neural algorithms are shown to arise as Riemannian-gradient flows on double Stiefel manifold and their geometric and dynamical properties are investigated with the help of differential geometry. Keywords: Singular value decomposition; orthogonal matrix group; Stiefel manifold; differential geometry; Lyapunov stability. 1. Introduction unsupervised learning by non-gradient techniques: For instance, in Ref. 1 a new technique was intro- The computation of the singular value decomposi- duced to enhance the learning capabilities of lin- tion (SVD) of a non-square matrix, also referred ear and MLP-type neural networks by Riemannian to as Autonne{Eckart{Young decomposition18;30 gradient, in Refs. 4 and 13 a theoretical deriva- plays a central role in several signal/data au- tion/analysis of new principal/minor subspace rules tomatic processing. Originally developed in nu- has been carried out; also, in Refs. 9 and 12 a large merical algebra to provide quantitative informa- class of learning rules for MLP-type neural networks, tion about the structure of linear systems of based on first/second-order non-gradient dynamics equations, it has found widespread applications and Lie-group flows, has been introduced and dis- 7{8;17;31 e.g. in signal processing, pattern recogni- cussed by the present Author as a theoretical frame- 26 24;30 tion and classification, automatic control, dig- work for explaining many learning paradigms ap- 28 ital circuit design, time-series prediction, image peared on the scientific literature, while papers10;11 6;21;25 2 processing and connectionism. were devoted to a particular algorithm of this class, Recently, some efforts have been devoted based on the rational kinematics of rigid bodies and to SVD computation by neural networks in its applications to real- and complex-valued signal 5;27;33 the neural community; the related learn- processing. ing theories emerge as interesting extensions The aim of this paper is to present some the- of the well-known neural principal compo- oretical notes on parallel SVD computation by 26 nent/subspace analysis techniques, long inves- unsupervised non-gradient neural learning, with tigated during the last 15 years. Also, recently special reference to learning theories involving a new light has been shed on adaptive second- weight-flows on double Stiefel manifold. Par- order (as well as higher-order) statistical de- allel techniques are considered in opposition to composition theories by researchers interested in sequential ones that employ the deflation method, 1 2 S. Fiori implemented by laterally connected neural architec- all-zero values except for the first r diagonal entries, tures, to discard previously computed vectors from termed singular values. It is easily checked that the original data.5;8 In particular, we recall from the sci- columns of U coincide with the eigenvectors of ZZ∗ entific literature four neural SVD learning theories while V contains the eigenvectors of Z∗Z with the appeared independently; then, as novel contribution same eigenvalues. to this field, we present: Here we consider four parallel SVD learning algorithms, which allow to simultaneously compute the A unifying view of the mentioned theories, show- • SVD vectors. The considered neural algorithms have ing the main relationships among them; been developed by Weingessel and Hornik32 and by A stability analysis based on Lyapunov criterion, • Helmke and Moore.18 These algorithms are utilized aimed at ensuring the non-divergence of the differ- to train in an unsupervised way a three-layer neural ential equations governing the learning phases of network with the classical ‘butterfly’ topology (see the SVD neural networks trained via the consid- e.g. Refs. 8, 32 and 33): The first layer has con- ered methods; nection matrix A, the second one has connection- A computer-based analysis of the learning differ- • matrix B and the middle (hidden) layer provides ential equations carried out in order to assess their network's output. Properly learnt, the network is numerical properties. able to perform the mentioned signal/data process- Through the paper we use the following notation. ing tasks, such as noise filtering.8 Symbol Im;n denotes the pseudo-identity matrix of The aim of this section is to analytically show 0 size m n and Im = Im;n. Symbol X denotes that the algorithms proposed by Weingessel{Hornik × ∗ the transposition of the matrix X while X denotes and Helmke{Moore are equivalent to some extent. Hermitian-transposition; symbol tr(X) denotes the Also, it is showed that when proper initial condi- trace of the square matrix X, i.e. the sum of its tions are chosen the associated learning trajectories in-diagonal entries; the trace operator enjoys the lie on the double Stiefel manifold. following properties: tr(X 0) = tr(X), tr(ABC) = tr(CAB) = tr(BCA). We also define the two ma- 2.1. The WH2, WH3 and WH4 neural def 0 0 def trix operators X; Y = X Y Y X and [X; Y ] = SVD-subspace dynamical systems 0 0 f g − X Y +Y X. The following matrix set (termed Stiefel In Ref. 32 some new learning equations have been manifold) is also useful to our expository purposes: def m×n ∗ introduced by Weingessel and Hornik in order to St(m; n; K) = X K X X = In with m 1, f 2 j g − compute the SVD-subspace of a given matrix. Here n 1 N; the field K may be either R or C; when − 2 we investigate on three of them, expressed as m = n the manifold coincides with the orthogonal continuous-time differential equations. The deriva- K def Km×m ∗ group O(m; ) = X X X = Im . We tions presented below make use of the matrix f 2 K j K g refer to the product O(m; ) O(n; ) as double differential calculus: A source reference for this is × K orthogonal group and to the product St(m; p; ) Ref. 23. K × St(n; p; ) as double Stiefel manifold (some defini- Let us denote as A(t) Rm×p the network- 2 tions and notes on these geometrical entities are connection matrix-stream that should learn p left available in the appendix A.1). Also, the Frobe- singular vectors and as B(t) Rn×p the estimator Rn×n 2 nius norm of a matrix X is defined as for p right singular vectors of the SVD of matrix def ∗ 2 X F = tr(X X). Z Rm×n, with p r min m; n , where r de- k k 2 ≤ ≤ f g p notes again the rank of matrix Z. The algorithm 2. Four Parallel SVD Learning WH232 reads: Algorithms: A Unifying View 0 0 A_ = ZB AB Z A ; A(0) = A0 ; m×n − (1) Denoting as Z C the matrix whose SVD is to _ 0 0 2 ( B = Z A BA ZB ; B(0) = B0 : be computed and as r min m; n the rank of Z, − ≤ f g ∗ the singular value decomposition writes Z = UDV , It has been derived by extending Brockett's work where U Rm×m and V Rn×n are orthogonal ma- on isospectral flow systems3 from single to double 2 2 trices and D is a pseudo-diagonal matrix that has orthogonal group; the initial state A0; B0 of the Singular Value Decomposition Learning on Double Stiefel Manifold 3 dynamical equations may be freely chosen. Here conditions (2): we consider the particular choice A0 St(m; p; R) 2 def 0 0 0 0 0 0 0 and B0 St(n; p; R), as for instance A0 = Im;p and S = A ZB = A (AB Z A) = (A A)B Z A 2 B0 = In;p. = B0Z0A = S0 : Theorem 1 Now, as A belongs to the Stiefel manifold St(m; p; R) If the initial states of the WH2 system belong to the at any time and has thus rank p, the equilibrium so- Stiefel manifold, then the whole dynamics is double- lution may be parameterized as A = UpKa; the same Stiefel. holds for B = VpKb, where Ka and Kb are matrices in O(p; R). This ensures A and B span the SVD- Proof subspace of Z. On the basis of this parameterization, the We wish to prove that if A0 St(m; p; R) and 0 0 0 2 product A ZB writes Ka(UpZVp)Kb, where, by def- B0 St(n; p; R), then A(t) St(m; p; R) and B(t) 0 2 2 2 inition, UpZVp = D1, the diagonal matrix of p St(n; p; R) for all t 0. 0 ≥ singular values. On the other hand, S = K D1K To show this for matrix A, it is sufficient to prove a b must be symmetric and this may hold only if K = that the trajectory emanating from any point such a 0 0 Kb = K. that A A = Ip has differential d(A A) = 0. Since d(A0A) = (dA)0A + A0(dA) = (A_0A + A0A_)dt = This shows that the WH2 algorithm does not ac- [A;_ A]dt, this may be proven by computing the brack- tually compute the true SVD, but a SVD-subspace ets [A_; A]: of dimension p.

Load more