PDF.Js Viewer

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 1863 Principal Composite Kernel Feature Analysis: Data-Dependent Kernel Approach Yuichi Motai, Senior Member, IEEE, and Hiroyuki Yoshida, Member, IEEE Abstract—Principal composite kernel feature analysis (PC-KFA) is presented to show kernel adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis (CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis (KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel method. The principal composite process for PC-KFA herein has been applied to kernel principal component analysis [34] and to our previously developed accelerated kernel feature analysis [20]. Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace can be first chosen by principal component analysis, and then be processed for composite kernel subspace through the efficient combination representations used for further reconstruction and classification. Numerical experiments based on several MID feature spaces of cancer CAD data have shown that PC-KFA generates efficient and an effective feature representation, and has yielded a better classification performance for the proposed composite kernel subspace using a simple pattern classifier. Index Terms—Principal component analysis, data-dependent kernel, nonlinear subspace, manifold structures Ç 1INTRODUCTION 3 APITALIZING on the recent success of kernel methods in eigenvalue problem that requires Oðn Þcomputations, and C pattern classification [62], [63], [64], [65], Scho¨lkopf and 2) each principal component in the Hilbert space depends on Smola [34] developed and studied a feature selection every one of the n input patterns, which defeats the goal of algorithm, in which principal component analysis (PCA) obtaining both an informative and concise representation. was effectively applied to a sample of n, d-dimensional Both of these deficiencies have been addressed in patterns that are first injected into a high-dimensional subsequent investigations that seek sets of salient features Hilbert space using a nonlinear embedding. Heuristically, that only depend upon sparse subsets of transformed embedding input patterns into a high-dimensional space input patterns. Tipping [43] applied a maximum-like- may elucidate salient nonlinear features in the input lihood technique to approximate the transformed covar- distribution, in the same way that nonlinearly separable iancematrixintermsofsuchasparsesubset.Francand classification problems may become linearly separable in Hlavać [21] proposed a greedy method, which approx- higher dimensional spaces as suggested by the Vapnik- imates the mapped space representation by selecting a Chervonenkis theory [14]. Both the PCA and the nonlinear subset of input data. It iteratively extracts the data in the embedding are facilitated by a Mercer kernel of two mapped space until the reconstruction error in the arguments k : Rd Rd !R, which effectively computes mapped high-dimensional space falls below a threshold the inner product of the transformed arguments. This value. Its computational complexity is Oðnm3Þ,wheren is algorithm, called kernel principal component analysis the number of input patterns and m is the cardinality of (KPCA), thus avoids the problem of representing transformed vectors in the Hilbert space, and enables the the subset. Zheng et al. [56] split the input data into computation of the inner-product of two transformed M groups of similar size, and then applied KPCA to each vectors of an arbitrarily high dimension in constant time. group. A set of eigenvectors was obtained for each group. Nevertheless, KPCA has two deficiencies: 1) The computa- KPCA was then applied to a subset of these eigenvectors tion of the principal components involves the solution of an to obtain a final set of features. Although these studies proposed useful approaches, none provided a method that is both computationally efficient and accurate. To avoid the Oðn3Þ eigenvalue problem, Mangasarian . Y. Motai is with the Sensory Intelligent Laboratory, Department of Electrical and Computer Engineering, School of Engineering, Virginia et al. [16] proposed sparse kernel feature analysis (SKFA), Commonwealth University, PO Box 843068, 601 West M ain Street, which extracts l features, one by one, using an l1-constraint Richmond, VA 23284-3068. E-mail: [email protected]. on the expansion coefficients. SKFA requires only Oðl2n2Þ . H. Yoshida is with 3D Imaging Research, Department of Radiology, operations, and is, thus, a significant improvement over H ar var d M edi cal School and M assachuset t s Gener al H ospi t al , 25 N ew KPCA ifthe number ofdominant features is much less than Chardon Street, Suite 400C, Boston, M A 02114. p ffiffiffi E-mail: [email protected]. the data size. However, if l> n, then the computational M anuscript received 24 June 2011; revised 29 Dec. 2011; accepted 10 M ay cost of SKFA is likely to exceed that of KPCA. 2012; published online 22 M ay 2012. In this paper, we propose an accelerated kernel feature Recommended for accept ance by X . Z hu. analysis (AKFA) that generates l sparse features from a data For information on obtaining reprints of this article, please send e-mail to: 2 [email protected],andreferenceIEEECSLogNumberTKDE-2011-06-0373. set of n patterns using Oðln Þ operations. Since AKFA is Digital Object Identifier no. 10.1109/TKDE.2012.110. based on both KPCA and SKFA, we analyze the former 1041-4347/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society 1864 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 algorithms, that is, KPCA and SKFA, and then describe between any finite sequences of inputs x :¼fxj : j2Nng AKFA in Section 2. and is defined as We have evaluated other existing multiple kernel learning (MKL) approaches [66], [68], and found that those K :¼ðKðxi ;xj Þ: i;j 2 NnÞ¼ ð ðx i Þ: ðx j ÞÞ: approaches do not rely on the data sets to combine and Commonly used kernel matrices are as follows [34]: choose the kernel functions very much. The choice of an appropriate kernel function has reflected prior knowledge . The linear kernel: concerning the problem at hand. However, it is often T difficult for us to exploit the prior knowledge on patterns Kðx;xi Þ¼ x xi ;ð1Þ for choosing a kernel function, and how to choose the best kernel function for a given data set is an open question. The polynomial kernel: According to the no free lunch theorem [40] on machine learning, there is no superior kernel function in general, and T d Kðx;xi Þ¼ x xi þoffset ;ð2Þ the performance of a kernel function depends on applica- tions, specifically the data sets. The five kernel functions, linear, polynomial, Gaussian, Laplace, and sigmoid, are . The Gaussian RBF kernel: chosen because they were known to have good perfor- 2 2 mances [40], [41], [42], [43], [44], [45]. Kðx;xi Þ¼ exp kx xi k =2 ;ð3Þ The main contribution of this paper is a principal composite kernel feature analysis (PC-KFA) described in . The Laplace RBF kernel: Section 3. In this new approach, the kernel adaptation is employed in the kernel algorithms above KPCA and AKFA Kðx;xi Þ¼ expð kxx i kÞ;ð4Þ in the form of the best kernel selection, engineer a composite kernel which is a combination of data-dependent kernels, and the optimal number of kernel combination. The sigmoid kernel: Other MKL approaches combined basic kernels, but our Kðx;x Þ¼ tanhð xT x þ Þ;ð5Þ proposed PC-KFA specifically chooses data-dependent i 0 i 1 kernels as linear composites. In Section 4, we summarize numerical evaluation experi- . The ANOVA RB kernel: mentsbased on medicalimagedata sets(MIDs)in computer- Xn aided diagnosis (CAD) using the proposed PC-KFA k k 2 d Kðx;xi Þ¼ exp x xi :ð6Þ 1. to choose the kernel function, k¼1 2. to evaluate feature representation by calculating reconstruction errors, . The linear spline kernel in one dimension: 3. to choose the number of kernel functions, Kðx;xi Þ¼ 1 þxxi minðx ; x i Þ 4. to composite the multiple kernel functions, ! 5. to evaluate feature classification using a simple xþ x ðminðx ; x Þ3 i minðx ; x Þ2 þ i : classifier, and 2 i 3 6. to analyze the computation time. Our conclusions appear in Section 5. ð7Þ Kernel selection is heavily dependent on the specific data 2KERNEL FEATURE ANALYSIS set. Currently, the most commonly used kernel functions 2.1 Kernel Basics are the Gaussian and Laplace RBF for general purpose Using Mercer’s theorem [15], a nonlinear, positive-definite when prior knowledge of the data is not available. Gaussian kernel k : Rd Rd !Rof an integral operator can be kernel avoids the sparse distribution while the high degree computed by the inner product of the transformed vectors polynomial kernel may cause the space distribution in large h ðx Þ; ðy Þi ,where : Rd !Hdenotes a nonlinear em- feature space. The polynomial kernel is widely used in bedding (induced by k) into a possibly infinite dimensional image processing while ANOVA RB is often used for Hilbert space H .Givenn sample points in the domain regression tasks. The spline kernels are useful for contin- d X n ¼fxi 2Rji ¼ 1; ...;ng,theimageYn ¼f ðx i Þji ¼ uous signal processing algorithms that involve B-spline 1; ...;ng of X n spans a linear subspace of at most (n 1) inner-products or the convolution of several spline basis dimensions. By mapping the sample points into a higher functions. Thus, in this paper, we will adopt only the first dimensional space, H , the dominant linear correlations in five kernels in (1)-(5).

Load more