Signal Processing, IEEE Transactions On
Total Page:16
File Type:pdf, Size:1020Kb
UC Riverside UC Riverside Previously Published Works Title Fast subspace tracking and neural network learning by a novel information criterion Permalink https://escholarship.org/uc/item/85h2b369 Journal IEEE Transactions on Signal Processing, 46(7) ISSN 1053-587X Authors Miao, Yongfeng Hua, Yingbo Publication Date 1998-07-01 Peer reviewed eScholarship.org Powered by the California Digital Library University of California IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 46, NO. 7, JULY 1998 1967 Fast Subspace Tracking and Neural Network Learning by a Novel Information Criterion Yongfeng Miao and Yingbo Hua, Senior Member, IEEE Abstract— We introduce a novel information criterion (NIC) to be globally convergent since the global minimum of the for searching for the optimum weights of a two-layer linear MSE is only achieved by the principal subspace, and all neural network (NN). The NIC exhibits a single global maximum the other stationary points of the MSE are saddle points. attained if and only if the weights span the (desired) principal subspace of a covariance matrix. The other stationary points of The MSE criterion has also led to many other algorithms, the NIC are (unstable) saddle points. We develop an adaptive which include the projection approximation subspace tracking algorithm based on the NIC for estimating and tracking the (PAST) algorithm [16], the conjugate gradient method [17], principal subspace of a vector sequence. The NIC algorithm and the Gauss–Newton method [18]. It is clear that a properly provides a fast on-line learning of the optimum weights for the chosen criterion is a very important part in developing any two-layer linear NN. We establish the connections between the NIC algorithm and the conventional mean-square-error (MSE) learning algorithm. based algorithms such as Oja’s algorithm, LMSER, PAST, APEX, In this paper, we introduce a novel information criterion and GHA. The NIC algorithm has several key advantages such (NIC) for searching for the optimum weights of a two-layer as faster convergence, which is illustrated through analysis and linear NN [see Fig. 2(a)]. The NIC exhibits a single global simulation. maximum attained if and only if the weights span the (desired) principal subspace of a covariance matrix. The other stationary I. INTRODUCTION points of the NIC are (unstable) saddle points. Unlike the MSE, the NIC is nonquadratic and has a steep landscape along the T IS KNOWN that there is a close relationship between a trajectory from a small weight matrix to the optimum one. class of linear neural networks (NN’s) and the concept of I Applying gradient ascent searching to the NIC yields the NIC principal subspace [1], [2]. The concept of principal subspace algorithm, which is globally convergent and fast. is often referred to as principal subspace analysis (PSA) in the The rest of this paper is organized as follows. Section II context of statistical analysis. When we are interested in the lists some notational symbols and introduces the conventional orthonormal eigenvectors spanning a principal subspace, the quadratic PSA formulation. In Section III, we propose the so-called principal component analysis (PCA) is then referred NIC formulation for PSA and depicts its landscape picture. to. Both PSA and PCA also represent the desired function of Comparisons are made to the conventional formulation and a class of linear NN’s when the weights of the NN’s span a the mutual information criterion (MIC). The NIC algorithm is principal subspace. The process of finding the proper weights derived in Section IV with comparisons to a number of ex- is called “learning.” isting PCA/PSA algorithms and some batch-mode eigenvalue A class of learning algorithms such as Oja’s subspace decomposition (EVD) algorithms. Section V deals with the algorithm [24], the symmetric error correction algorithm [5], asymptotic global convergence analysis of the NIC algorithm and the symmetric version of the back propagation algorithm using the Lyapunov function approach. In Section VI, we [6] were developed in the past based on some heuristic investigate various potential applications of the NIC algorithm, reasoning. However, the work in [2] showed that all these which include two-layer linear NN learning, signal subspace algorithms turn out to be identical. This class of algorithms tracking, and adaptive direction of arrival (DOA) estimation will be referred to as Oja’s algorithm. The convergence and tracking. Performance of the NIC algorithm and other property of Oja’s algorithm remained as a mystery for some algorithms is demonstrated and compared through simulation time until the analyses done in [8]–[10], [32]. An improved examples. Conclusions are drawn in Section VII. version of Oja’s algorithm, called the least mean square error reconstruction (LMSER) algorithm, was developed in [7], where the well-known concept of gradient searching was II. PRELIMINARIES applied to minimize a mean squared error (MSE). Unlike Oja’s algorithm, the LMSER algorithm could be claimed A. Notations and Acronyms Manuscript received July 21, 1997; revised October 13, 1997. This work Some notational symbols and acronyms are listed below. was supported by the Australian Coorperative Research Centre for Sensor Signal and Information Processing (CSSIP) and the Australian Research Set of all real matrices - Council. The associate editor coordinating the review of this paper and dimensional real vectors). approving it for publication was Prof. Yu-Hen Hu. Transpose of a matrix . The authors are with the Department of Electrical and Electronic Engineer- ing, The University of Melbourne, Parkville, Victoria, Australia. tr Trace of . Publisher Item Identifier S 1053-587X(98)04412-2. rank Rank of . 1053–587X/98$10.00 1998 IEEE 1968 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 46, NO. 7, JULY 1998 vec Vector formed by stacking columns of yield the exact principal eigenvectors, the PSA is sufficient to one beneath the other. yield the optimal solution by an arbitrary orthonormal base Natural logarithm of a symmetric posi- vectors spanning the principal subspace. tive definite matrix . Conventionally, the PSA is formulated into either of the Determinant of . following two optimization frameworks [7], [11]: Frobenius norm of . • Maximize the variance of with an orthonormality Euclidean norm of . constraint on , i.e., Kronecker product of two matrices. Difference matrix is positive tr (nonnegative) definite. identity matrix. subject to (2) Null matrix or vector. • Minimize the MSE reconstruction of from , i.e., diag Diagonal matrix with diagonal elements . commutation matrix such that . tr tr (3) Expectation, differential, partial differ- It is shown in [7] that the stationary points of ential, gradient, and Hessian operators, satisfy Therefore, by imposing the constraint respectively. on (3), it can be seen that the two frameworks are We use the follwing acronyms in this paper. in fact equivalent. The landscape of has been shown APEX adaptive principal component extraction; to have a global minimum at the principal subspace with all the EVD eigenvalue decomposition; other stationary points being saddles [7], [16]. Based on this MIC mutual information criterion; fact, the LMSER algorithm was derived by following the exact NIC novel information criterion; gradient descent rule to minimize Oja’s subspace PAST projection approximation subspace tracking. algorithm can also be derived as an approximate gradient algo- rithm from the LMSER [11], [12]. Although both algorithms B. Conventional PSA Formulation are able to converge to the principal subspace solution, their Suppose that the vector sequence is a sta- convergence speed is slow due to the slow convergence of tionary stochastic process with zero-mean and the covariance gradient searching of quadratic criteria. The nonquadratic NIC matrix Let and formulation shown next has a better defined landscape for denote the eigenvalues and the corresponding orthonormal PSA, which improves the convergence of gradient searching. eigenvectors of We shall arrange the orthonormal eigen- vectors in such a way that the corresponding III. NOVEL INFORMATION CRITERION FORMULATION FOR PSA eigenvalues are in a nonascending order: Various nonquadratic generalizations of the above quadratic Note that we do not constrain formulation have been proposed [11], which generally yield to be nonsingular. robust PCA/PSA solutions that are totally different from the In some applications, may be composed of independent standard ones. In contrast, the following NIC formulation not signal components embedded in an -dimensional noise with only retains the standard PSA solution but also results in a fast For example, in array processing, is often composed PSA algorithm with some attractive properties. of narrowband signal waves impinging on an -element antenna array with white receiver noise [3]. In this case, the A. NIC Formulation for PSA last eigenvalues of are caused by noise that must be separated from the signal. The data may also represent Given in the domain , we propose the a sequence of image or speech phonemes, and we may be following framework for PSA: interested in storing them using a minimum amount of memory tr tr (4) [15]. The PCA or PSA produces an optimal solution to these applications in the sense that it minimizes the MSE between where the matrix logarithm is defined in [4]. This criterion is and its reconstruction or maximizes the variance of referred to as the NIC because it is different from all existing defined by the linear transformation PSA criteria and is also closely related to the mutual infor- (1) mation criterion (MIC) shown in [10] and [19]. Nevertheless, it is worth noting that despite their similar appearance, the where denotes the optimal weight matrix NIC and the MIC are in fact very different, as will be shown whose columns span the same space as , later in this section. The landscape of NIC is depicted by the which is commonly referred to as the principal subspace, following two theorems. Since the matrix differential method and is the low-dimensional representation of If will be used extensively in deriving derivatives, we present , then (1) performs the true PCA.