IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 1863 Principal Composite Kernel Feature Analysis: Data-Dependent Kernel Approach

Yuichi Motai, Senior Member, IEEE, and Hiroyuki Yoshida, Member, IEEE

Abstract—Principal composite kernel feature analysis (PC-KFA) is presented to show kernel adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis (CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis (KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel method. The principal composite process for PC-KFA herein has been applied to kernel principal component analysis [34] and to our previously developed accelerated kernel feature analysis [20]. Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace can be first chosen by principal component analysis, and then be processed for composite kernel subspace through the efficient combination representations used for further reconstruction and classification. Numerical experiments based on several MID feature spaces of cancer CAD data have shown that PC-KFA generates efficient and an effective feature representation, and has yielded a better classification performance for the proposed composite kernel subspace using a simple pattern classifier.

Index Terms—Principal component analysis, data-dependent kernel, nonlinear subspace, manifold structures Ç

1INTRODUCTION 3 APITALIZING on the recent success of kernel methods in eigenvalue problem that requires Oðn Þcomputations, and C pattern classification [62], [63], [64], [65], Scho¨lkopf and 2) each principal component in the Hilbert space depends on Smola [34] developed and studied a feature selection every one of the n input patterns, which defeats the goal of algorithm, in which principal component analysis (PCA) obtaining both an informative and concise representation. was effectively applied to a sample of n, d-dimensional Both of these deficiencies have been addressed in patterns that are first injected into a high-dimensional subsequent investigations that seek sets of salient features Hilbert space using a nonlinear embedding. Heuristically, that only depend upon sparse subsets of transformed embedding input patterns into a high-dimensional space input patterns. Tipping [43] applied a maximum-like- may elucidate salient nonlinear features in the input lihood technique to approximate the transformed covar- distribution, in the same way that nonlinearly separable iancematrixintermsofsuchasparsesubset.Francand classification problems may become linearly separable in Hlava´c [21] proposed a greedy method, which approx- higher dimensional spaces as suggested by the Vapnik- imates the mapped space representation by selecting a Chervonenkis theory [14]. Both the PCA and the nonlinear subset of input data. It iteratively extracts the data in the embedding are facilitated by a Mercer kernel of two mapped space until the reconstruction error in the arguments k : Rd Rd !R, which effectively computes mapped high-dimensional space falls below a threshold the inner product of the transformed arguments. This value. Its computational complexity is Oðnm3Þ,wheren is algorithm, called kernel principal component analysis the number of input patterns and m is the cardinality of (KPCA), thus avoids the problem of representing trans- formed vectors in the Hilbert space, and enables the the subset. Zheng et al. [56] split the input data into computation of the inner-product of two transformed M groups of similar size, and then applied KPCA to each vectors of an arbitrarily high dimension in constant time. group. A set of eigenvectors was obtained for each group. Nevertheless, KPCA has two deficiencies: 1) The computa- KPCA was then applied to a subset of these eigenvectors tion of the principal components involves the solution of an to obtain a final set of features. Although these studies proposed useful approaches, none provided a method that is both computationally efficient and accurate. To avoid the Oðn3Þ eigenvalue problem, Mangasarian . Y. Motai is with the Sensory Intelligent Laboratory, Department of Electrical and Computer Engineering, School of Engineering, Virginia et al. [16] proposed sparse kernel feature analysis (SKFA), Commonwealth University, PO Box 843068, 601 West M ain Street, which extracts l features, one by one, using an l1-constraint Richmond, VA 23284-3068. E-mail: [email protected]. on the expansion coefficients. SKFA requires only Oðl2n2Þ . H. Yoshida is with 3D Imaging Research, Department of Radiology, operations, and is, thus, a significant improvement over H ar var d M edi cal School and M assachuset t s Gener al H ospi t al , 25 N ew KPCA ifthe number ofdominant features is much less than Chardon Street, Suite 400C, Boston, M A 02114. p ffiffiffi E-mail: [email protected]. the data size. However, if l> n, then the computational M anuscript received 24 June 2011; revised 29 Dec. 2011; accepted 10 M ay cost of SKFA is likely to exceed that of KPCA. 2012; published online 22 M ay 2012. In this paper, we propose an accelerated kernel feature Recommended for accept ance by X . Z hu. analysis (AKFA) that generates l sparse features from a data For information on obtaining reprints of this article, please send e-mail to: 2 [email protected],andreferenceIEEECSLogNumberTKDE-2011-06-0373. set of n patterns using Oðln Þ operations. Since AKFA is Digital Object Identifier no. 10.1109/TKDE.2012.110. based on both KPCA and SKFA, we analyze the former

1041-4347/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society 1864 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 algorithms, that is, KPCA and SKFA, and then describe between any finite sequences of inputs x :¼fxj : j2Nng AKFA in Section 2. and is defined as We have evaluated other existing multiple kernel learning (MKL) approaches [66], [68], and found that those K :¼ðKðxi ;xj Þ: i;j 2 NnÞ¼ ð ðx i Þ: ðx j ÞÞ: approaches do not rely on the data sets to combine and Commonly used kernel matrices are as follows [34]: choose the kernel functions very much. The choice of an appropriate kernel function has reflected prior knowledge . The linear kernel: concerning the problem at hand. However, it is often T difficult for us to exploit the prior knowledge on patterns Kðx;xi Þ¼ x xi ;ð1Þ for choosing a kernel function, and how to choose the best kernel function for a given data set is an open question. . The polynomial kernel: According to the no free lunch theorem [40] on machine learning, there is no superior kernel function in general, and T d Kðx;xi Þ¼ x xi þoffset ;ð2Þ the performance of a kernel function depends on applica- tions, specifically the data sets. The five kernel functions, linear, polynomial, Gaussian, Laplace, and sigmoid, are . The Gaussian RBF kernel: chosen because they were known to have good perfor- 2 2 mances [40], [41], [42], [43], [44], [45]. Kðx;xi Þ¼ exp kx xi k =2 ;ð3Þ The main contribution of this paper is a principal composite kernel feature analysis (PC-KFA) described in . The Laplace RBF kernel: Section 3. In this new approach, the kernel adaptation is employed in the kernel algorithms above KPCA and AKFA Kðx;xi Þ¼ expð kxx i kÞ;ð4Þ in the form of the best kernel selection, engineer a composite kernel which is a combination of data-dependent kernels, and the optimal number of kernel combination. . The sigmoid kernel: Other MKL approaches combined basic kernels, but our Kðx;x Þ¼ tanhð xT x þ Þ;ð5Þ proposed PC-KFA specifically chooses data-dependent i 0 i 1 kernels as linear composites. In Section 4, we summarize numerical evaluation experi- . The ANOVA RB kernel: mentsbased on medicalimagedata sets(MIDs)in computer- Xn aided diagnosis (CAD) using the proposed PC-KFA k k 2 d Kðx;xi Þ¼ exp x xi :ð6Þ 1. to choose the kernel function, k¼1 2. to evaluate feature representation by calculating reconstruction errors, . The linear spline kernel in one dimension: 3. to choose the number of kernel functions, Kðx;xi Þ¼ 1 þxxi minðx ; x i Þ 4. to composite the multiple kernel functions, ! 5. to evaluate feature classification using a simple xþ x ðminðx ; x Þ3 i minðx ; x Þ2 þ i : classifier, and 2 i 3 6. to analyze the computation time. Our conclusions appear in Section 5. ð7Þ Kernel selection is heavily dependent on the specific data 2KERNEL FEATURE ANALYSIS set. Currently, the most commonly used kernel functions 2.1 Kernel Basics are the Gaussian and Laplace RBF for general purpose Using Mercer’s theorem [15], a nonlinear, positive-definite when prior knowledge of the data is not available. Gaussian kernel k : Rd Rd !Rof an integral operator can be kernel avoids the sparse distribution while the high degree computed by the inner product of the transformed vectors polynomial kernel may cause the space distribution in large h ðx Þ; ðy Þi ,where : Rd !Hdenotes a nonlinear em- feature space. The polynomial kernel is widely used in bedding (induced by k) into a possibly infinite dimensional image processing while ANOVA RB is often used for Hilbert space H .Givenn sample points in the domain regression tasks. The spline kernels are useful for contin- d X n ¼fxi 2Rji ¼ 1; ...;ng,theimageYn ¼f ðx i Þji ¼ uous signal processing algorithms that involve B-spline 1; ...;ng of X n spans a linear subspace of at most (n 1) inner-products or the convolution of several spline basis dimensions. By mapping the sample points into a higher functions. Thus, in this paper, we will adopt only the first dimensional space, H , the dominant linear correlations in five kernels in (1)-(5). the distribution of the image Yn may elucidate important A choice of appropriate kernel functions as a generic nonlinear dependences in the original data sample X n. learning element has been a major problem since classifica- This is beneficial because it permits making PCA nonlinear tion accuracy itself heavily depends on the kernel selection. without complicating the original PCA algorithm. Let us For example, Amari and Wu [66] modified the kernel introduce kernel matrix K as a Hermitian and positive function by extending the Riemannian geometry structure semi-definite matrix that computes the inner product induced by the kernel. Souza and Carvalho [33] proposed MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1865 Xn Xn selecting the hyper planes parameters by using k-fold cross T j ej ¼Sej ¼ ðx i Þ ðx i Þ ej ¼ hej ; ðx i Þi ðx i Þ;ð8Þ validation and leave-one-out criteria. Ding and Dubchak i¼1 i¼1 [57]proposed an ad hoc ensemble learning approach where for j¼1; ...;n.Since is not known, (8) must be solved multiclass k-nearest neighborhood classifiers were indivi- indirectly as proposed in the next Section. Let us introduce dually trained on each feature space and later combined. the inner product of the transformed vectors by Damoulas and Girolami [31] proposed the use of four 1 additional feature groups to replace the amino-acid aji ¼ ej ; ðx i Þ ; j composition. Pavlidis et al. [58] performed feature selection for SVMs by combining the feature scaling technique with where the leave-one-out error bound. Chapelle et al. [59] tuned Xn multiple parameters for two-norm SVMs by minimizing the ej ¼ aji ðx i Þ:ð9Þ radius margin bound or the span bound. Ong et al. [60] i¼1 T applied semidefinite programming to learn kernel function Multiplying by ðx qÞ on the left, for q¼1; ...;n,and by hyperkernel. Lanckriet et al. [61] designed kernel matrix substituting yields directly by semidefinite programming. Xn MKL has been considered as a solution to make the j h ðx qÞ;ej i¼ hej ; ðx i Þih ðx qÞ; ðx i Þi: ð10Þ kernel choice in a feasible manner. Amari and Wu [66] i¼1 proposed a method of modifying a kernel function to Substitution of (9) into (10) produces improve the performance of support vector machine *+ classifier based on the Riemannian geometrical structure Xn induced by the kernel function. This idea was to enlarge j ðx qÞ; aji ðx i Þ i¼1 the spatial resolution around the separating boundary !ð11Þ surface by a conformal mapping such that the separability Xn Xn ¼ ha ðx Þ; ðx Þih ðx Þ; ðx Þi ; between classes can be increased in the kernel space. The jk k i q i i¼1 k¼1 experiments results showed remarkable improvement for 2 generalization errors. Rakotomamonjy et al. [68] adopted which can be rewritten as, j Kaj ¼K aj ,whereK is an MKL method to learn a kernel and associate predictor in n n Gram matrix, with the element kij ¼h ðx i Þ; ðx j Þi and a ¼½a a ...a T . The latter is a dual eigenvalue supervised learning settings at the same time. This study j j 1 j 2 jn problem equivalent to the problem illustrated the usefulness of MKL for some regressions based on wavelet kernels and on some model selection j aj ¼Kaj :ð12Þ problems related to multiclass classification problems. Note that jja jj2 ¼ 1= . In this paper, we propose a single multiclass kernel j j For example, we may choose a Gaussian kernel such as machinethatisabletooperateonallgroupsoffeatures simultaneously and adaptively combine them. This new 1 k ¼h ðx Þ; ðx Þi¼exp kx x k2 :ð13Þ framework provides a new and efficient way of incorporat- ij i j 2 2 i j ing multiple feature characteristics without increasing the number of required classifiers. The proposed approach is Please note that if the image of X n (finite sequences of based on the ability to embed each object description [47]via inputs x :¼fxj : j2Nng) is not centered in the Hilbert space, we need to use the centered Gram Matrix deduced the kernel trick into a kernel Hilbert space. This process by Mangasarian et al. [16] by applying the following K^: applies a similarity measure to every feature space. We show in this paper that these similarity measures can be K¼K^ KT TK þ TK T; ð14Þ combined in the form of the composite kernel space. We design a new single multiclass kernel machine that can where K is the Gram matrix of uncentered data, and operate composite spaces effectively by evaluating principal 2 1 1 3 n n components of the number of kernel feature spaces. A T¼4 5 : hierarchical multiclass model enables us to learn the 1 1 n n n n significance of each source/ feature space,and the predictive term computed by the corresponding kernel weights may Let us keep the l eigenvectors associated with the l largest provide the regressors and the kernel parameters without eigenvalues, we can reconstruct data in the mapped space: resorting to ad hoc ensemble learning, the combination of 0 ¼ l h ;eie ¼ l e ; binary classifiers, or unnecessary parameter tuning. i j¼1 i j j j¼1 ji j where ¼h ; n a i¼ n a k .Fortheexperi- 2.2 Kernel Principal Component Analysis ji i k¼1 jk k k¼1 jk ik mental evaluation, we introduce the reconstruction square KPCA uses a Mercer kernel [34] to perform a linear PCA. error of each data i ;i ¼1; ...;n,is ThegraylevelimageofX n of computed tomographic colonography (CTC) has been centered so that its scatter X‘ 0 2 2 n T Erri ¼k i i k ¼kii ji: matrix of the data is given by S¼ i¼ ð ðx i Þ ðx i Þ . 1 j¼1 Eigenvalues j and eigenvectors ej are obtained by solving 1866 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

n i The mean square error is MErr ¼ð1=nÞ i¼1Erri .Using ith updated Gram matrix K , where each element is i i i (12), ji ¼ j ji. Therefore, the mean square reconstruction kjk ¼h j ; ki . error is The second improvement described in “2” above is to revise the l1 constraint. SKFA treats each individual sample n l 2 2 MErr ¼ð1=nÞ i¼1 kii j¼1 j aji : data as a possible direction and computes the projection variances with all data. Since SKFA includes its length in n n n 2 2 Since i¼1kii ¼ i¼1 i and i¼1aji ¼jjaj jj ¼ 1= j ,MErr¼ its projection variance calculation, it is biased to select 1 P n n i¼‘þ1 i . vectors with larger magnitude. We are ultimately looking The KPCA algorithm contains an eigenvalue problem of for a direction with unit length, and when we choose an rank n, so the computational complexity of KPCA is Oðn3Þ. image vector as a possible direction, we ignore the length In addition, each resulting eigenvector is represented as a and only consider its direction for the improved accuracy linear combination of n terms; the l features depend on of the features. n image vectors of X n. Thus, all data contained in X n must The third improvement in “3” is to discard negligible be retained, which is computationally cumbersome and data and thereby eliminate unnecessary computations. unacceptable for our applications. AKFA is described in the following three steps and 0 showed the improvements 1-3[20].The vector i represents the reconstructed new data based on AKFA, and it can be calculated indirectly using the kernel trick:

‘ 0 T i ¼ h i ; j i j ¼ ‘ C‘ C‘ Ki; j¼1

T where K i ¼½ki ; idxð1Þki ; idxð2Þ...ki ; idxðlÞ . Then the re- construction error of new data i ;i ¼nþ 1; ...;nþ m,is represented as 2.3 Accelerated Kernel Feature Analysis Err ¼ 0 2¼k KT C CT K : AKFA [20] is the method that we have proposed to i i i ii i l l i improve the efficiency and accuracy of SKFA [16]. SKFA improves the computational costs of KPCA, associated with both time complexity and data retention requirements. SKFA was introduced in [16] and is summarized in the following three steps:

The AKFA algorithm also contains an eigenvalue AKFA [34] has been proposed by the author in an problem of rank n, so the computational complexity of attempt to achieve further improvements: 1) saves compu- AKFA is step 1 requires Oðn2Þoperations, Step 2 is Oðl2Þ. tation time by iteratively updating the Gram matrix, Step 3 requ ires 1 for Oðn2Þ,2forOði 2Þ,and3forOðn2Þ.The 2) normalizes the images with the l2 constraint before the total computational complexity is increased to Oðln2Þwhen l1 constraint is applied, and 3) optionally discards data that no data is being truncated during updating in the AKFA. falls below a magnitude threshold during updates. To achieve the computation efficiency described in “1,” 2.4 Comparison of the Relevant Kernel Methods instead of extracting features directly from the original Multiple kernel adoption and combination methods are mapped space, AKFA extracts the ith feature based on the derived from the principle of empirical risk minimization, MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1867

TABLE 1 and, thus, have tried to solve it. However, the few existing Overview of Method Comparison for Parameters Tunring solutions lack effectiveness, and thus this problem is still underdevelopment or regarded as an open problem. To this end, we are developing a new framework of kernel adaptation. Our method exploits the idea presented in [35], and [36], by exploring data-dependent kernel metho- dology as follows: Let fxi ;yi gði ¼ 1; 2; ...;nÞbe nd-dimensional training samples of the given data, where yi ¼fþ1; 1g represents the class labels of the samples. We develop a data- dependent kernel to capture the relationship among the data in this classification task by adopting the idea of “conformal mapping” [36]. To adopt this conformal transformation technique, this data-dependent composite kernel for r¼1; 2; 3; 4; 5 can be formulated as

kr ðx i ;xj Þ¼ qr ðx i Þqr ðx j Þpr ðx i ; x j Þ;ð15Þ

where pr ðx i ;xj Þis one kernel among five chosen kernels and qð Þ, the factor function, takes the following form for r¼1; 2; 3; 4; 5: Xn qr ðx i Þ¼ r;0 þ r;mk0ðx i ;xmÞ;ð16Þ m¼1 2 2 where k0ðx i ;xmÞ¼ expð jjxi xmjj =2 Þ,and r;m is the combination coefficient for the variable of xm.Letusdenote T T the vectors fqr ðx 1Þ;qr ðx 2Þ; ...;qr ðx nÞg and f 0; 1; ngr by qr and r ðr ¼ 1; 2; 3; 4; 5Þ, respectively, where we have qr ¼K0 r ,whereK 0 is a n ðn þ 1Þmatrix given by 2 3 1 k0ðx 1;x1Þ k0ðx 1;xnÞ 6 7 6 1 k0ðx 2;x1Þ k0ðx 2;xnÞ7 K 0 ¼ 6 . . . . 7 :ð17Þ 4 . . . . . 5 1 k0ðx n;x1Þ k0ðx n;xnÞ which performs well in most applications. Actually, to access the expected risk, there is an increasing amount of Let the kernel matrices corresponding to kðxi ;xj Þ;p1ðx i ;xj Þ the literature focusing on the theoretical approximation and p2ðx i ;xj Þbe K , P1,andP2,respectively.Wecanexpress error bounds with respect to the kernel selection problem, data-dependent kernel K as for example, empirical risk minimization, structural risk K ¼½q ðx i Þq ðx j Þp ðx i ; x j Þ :ð18Þ minimization, approximation error, span bound, Jaakkola- r r r n n Haussler bound, radius-margin bound, and kernel linear Defining Qi as the diagonal matrix of elements fqi ðx 1Þ; discriminant analysis. The following table lists some qi ðx 2Þ; ...;qi ðx xnÞg, we can express (18) as the matrix form comparative methods among the multiple kernel methods. Those kernel approaches listed in Table 1 have somehow K r ¼Qr Pr Qr :ð19Þ overlapped the principles and (dis)advantages, depending This kernel model was first introduced in [32] and called onthenatureofdata.Theproposedmethoddescribedin Section 3 has a tradeoff in the computational time and “conformal transformation of a kernel.” We now perform accuracy, but outperformed those counterparts, even if bias kernel optimization based on the method to find the of the data set exists due to cancer screening purposes. The appropriate kernels for the data set. proposed method works on the condition to measure the The optimization of the data-dependent kernel in (19) is adaptability of a kernel to the target data. The introduced to set the value of combination coefficient vector r so that alignment measure provides a practical objective for kernel the class separability of the training data in mapped feature optimization as a method for measuring the fitness between space is maximized. For this purpose, Fisher scalar is a kernel and the learning task. adopted as the objective function of our kernel optimization. Fisher scalar measures the class separability of the training data in the mapped feature space and is formulated as 3PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS J¼trðSbrÞ=trðSwrÞ;ð20Þ

3.1 Kernel Selections where Sb1 and Sb2 represent the “between-class scatter For kernel-based learning algorithms, the key challenge lies matrices” and Sw1 and Sw2 are the “within-class scatter in the selection of kernel parameters and regularization matrices.” Suppose that the training data are grouped parameters. Many researchers have identified this problem according to their class labels, i.e., the first n1 data belong to 1868 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 one class and the remaining n2 data belong to the other class 3.2 Kernel Combinatory Optimization (n1 þn2 ¼n). Then, the basic kernel matrix Pi can be In this section, we propose a principal composite kernel partitioned as function that is defined as the weighted sum of the set of different optimized kernel functions [41], [42]. To obtain an P r Pr P ¼ 11 : 12 ;ð21Þ optimum kernel process, we define the following composite r P r Pr 21 22 kernel as r r r r where the sizes of the submatrices P11, P12, P21, P22, r¼ 1; 2; 3; 4; 5aren n , n n , n n , n n , respectively. 1 1 1 2 2 1 2 2 p A close relation between the class separability measure J X K ðÞ¼ Q P Q ;ð28Þ and the kernel matrices has been established as comp i i i i i¼1 T r M 0r r Jð r Þ¼ T ;ð22Þ r N0r r where is the constant scalar value of the composite coefficient, and p is the number of kernels we intend to where combine. Through this approach, the relative contribution T T of both kernels to the model can be varied over the input M 0r ¼K0 B0r K 0 and N0r ¼K0 W0r K 0;ð23Þ space. We note that in (28), instead of using K r as a kernel and for r¼1; 2; 3; 4; 5 matrix, we use K comp as a composite kernel matrix. ! According to [46], K comp satisfies the Mercers condition. 1 r 0 1 P11 We use linear combination of individual kernels to yield an B0r ¼ n : 1 r Pr ;ð24Þ 1 P n optimal composite kernel using the concept of kernel 0 n2 22 alignment: “conformal transformation of a kernel.” The

1 r empirical alignment between kernel k1 and kernel k2 with r r r P 0 W ¼diag p ;p ; ...;p n1 11 : : respect to the training set S, is the following quantity 0r 11 22 nn 0 1 P r n2 22 metric:

To maximize Jð r Þ in (22), the standard gradient approach is followed. If matrix N0i is nonsingular, the hK 1;:K2i F : optimal i that maximizes Ji ð i Þ is the eigenvector Aðk1;k2Þ¼ ;ð29Þ corresponding to the maximum eigenvalue of the system, kK 1kF kK 2kF we will drive the following (22) as taking the derivatives

where K i is the kernel matrix for the training set S using M 0r r ¼ r N0r r :ð25Þ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kernel function K i ,andkkK i F ¼ hiK i ;Ki F , hK i ;Kj i F is the The criterion for selecting the best kernel function is to find Frobenius inner product between K i and K j . S¼fðxi ; the kernel that produces the largest eigenvalue from (25), i.e. yi Þjxi 2X;yi 2fþ1; 1g; i ¼ 1; 2; ...;ng, X is the input space, y is the target vector. Let K ¼yy0, then the empirical ¼ argmax N 1M :ð26Þ 2 r r r alignment between kernel K and target vector y is

The idea behind it is to choose the maximum eigenvector i corresponding to the maximum eigenvalue that can max- hK ; y y 0i y0Ky Aðk;yy0Þ¼ F ¼ :ð30Þ imize the Ji ð i Þthat will result in the optimum solution. We kK k kyy0k nkK k find the maximum Eigen values for all possible kernel F F F functions and arrange them in descending order to choose the most optimum kernels, such as It has been shown that if a kernel is well aligned with the target information, there exists a separation of the data with > > > > :ð27Þ 1 3 4 2 5 a low bound on the generalization error. Thus, we can We choose the kernels corresponding to the largest optimize the kernel alignment based on training set eigenvalues 1 and forming composite kernels correspond- information to improve the generalization performance of ing to f 1; 3 ...g as follows. the test set. Let us consider the combination of kernel functions as follows:

Xp kð Þ¼ i ki ;ð31Þ i¼1

where individual kernels ki , i¼1; 2; ...;p are known in advance. Our purpose is to tune to maximize Að ;k;yy0Þ the empirical alignment between kð Þand the target vector y.Hence,wehave

^ ¼ arg maxðA ð ;k;yy0ÞÞ;ð32Þ MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1869

TABLE 2 TABLE 4 Colon Cancer Data Set 1 (Low Resolution) Colon Cancer Data Set 3

TABLE 3 TABLE 5 Colon Cancer Data Set 2 Colon Cancer Data Set 4

0 1 P 0 The PC-KFA algorithm contains an eigenvalue problem of B h i i K i ;yyi C ¼ arg max@ q ffiffiPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA rank n, so the computational complexity of PC-KFA is n h i i K i i;h j j K j i Step 1 requ ires Oðn2Þoperations, Step 2 is n.Step3requires 0 1 n operations. Step 4 requires n operations. The total P 0 2 B i i hK i ;yyi C computational complexity is increased to Oðn Þ. ¼ arg max@ q ffiPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA ð33Þ n i;j i ; j hK i ;Kj i P ! 4EXPERIMENTAL ANALYSIS ð u Þ2 P i i i 4.1 Cancer Image Data Sets ¼ arg max 2 n i j vij i;j 4.1.1 Colon Cancer T 1 U This data set consisted of true-positive (TP) and false- ¼ arg max 2 T ; n V positive (FP) detections obtained from our previously where developed CAD scheme for the detection of polyps [5], p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi when it was applied to a CTC image database. This database T ui ¼ hK i ;yy i ;vij contained 146 patients who underwent a bowel preparation q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi regimen with a standard precolonoscopy bowel-cleansing ¼ hK i ;Kj i ;Uij ¼ui uj ;Vij method. Each patient was scanned in both supine and prone p ffiffiffiffip ffiffiffiffi p ffiffiffiffi ¼vi vj ; ¼ 1; 2; ...; p : positions, resulting in a total of 292 CT data sets. In the scanning, helical single-slice or multislice CT scanners were Let the generalized Raleigh coefficient be used, with collimations of 1.25-5.0 mm, reconstruction intervals of 1.0-5.0 mm, X -ray tube currents of 50-260 mA T U Jð Þ¼ :ð34Þ and voltages of 120- 140 kVp. In-plane voxel sizes were 0.51- T V 0.94 mm, and the CT image matrix size was 512 512. Out of146patients,there were 108normal cases and 38abnormal Therefore, we can obtain the value of ^ by solving the cases with a total of 39colonoscopy-confirmed polyps larger generalized eigenvalue problem than 6 mm. U ¼ V ;ð35Þ The CAD scheme was applied to the entire cases and it generated a segmented region for each of its detection where denotes the eigenvalues. (a candidate of polyp). A volume-of-interest (VOI) of size 64 64 64voxels was placed at the center of mass of each candidate for encompassing its entire region; then, it was resampled to 12 12 12 voxels. Resulting VOIs of 39 TP and 149 FP detections from the CAD scheme made up the colon cancer data set 1. Additional CTC image databases with a similar cohort of patients were collected from three different hospitals in the 1870 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

TABLE 6 TABLE 9 Colon Cancer Data Set 5 Lymphoma Data Set

TABLE 10 TABLE 7 Prostate Cancer Data Set Breast Cancer Data Set

are diffuse large B-cell lymphomas and the remainder is TABLE 8 follicular lymphomas, with each feature dimension of Lung Cancer Data Set 7,129. Detailed information about this data set can be found in [48].

4.1.5 Prostate Cancer This data set is collected from http:/ / www.ncbi.nlm.nih. gov/ geo/ query/ acc.cgi?acc = GSE6919. It contains prostate cancer data collected from 308 patients, 75 of which have metastatic prostate tumor and the rest of the cases werenormal,witheachfeaturedimensionof12,553.More information on this data set can be found in [51] and [52]. 4.2 Kernel Selection US. The VOIs obtained from these databases were re- We first evaluate herein the performance on the kernel sampled to 16 16 16 voxels. We refer to the resulting selection according to the method proposed in Section 3.1, data sets as colon cancer data sets 2, 3, 4, 5, and 6 for regarding how to select the kernel function that will best Tables 2, 3, 4, 5, and 6 with the distribution of the training fit the data. The larger the eigenvalue is, the greater the and testing VOIs, respectively. class separability measure J in (22) is to be expected. Table 11 shows the calculation of the algorithm for all the 4.1.2 Breast Cancer data sets mentioned to determine the eigenvalues of all the We extended our own colon cancer data sets into other five kernels. Specifically, we have set the parameters such as d, offset, , ,and of each kernel in (1)-(5), and cancer-relevant data sets.This data set is available at http:/ / 0 1 computed their eigenvalues for all the nine data sets www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi?acc = GSE2990. Tables 2, 3, 4, 5, 6, 7, 8, 9, and 10. After arranging the This data set contains data on 189 women, 64 of which were eigenvalues for each data set in descending order treated with tamoxifen, with primary operable invasive we selected the kernel corresponding to the largest breast cancer, with each feature dimension of 22283. More eigenvalue as the optimum kernel. information on this data set can be found in [50]. The largest eigenvalue for each data set is highlighted in Table 11. After evaluating the quantitative eigenvalues for 4.1.3 Lung Cancer all the nine data sets, we observed that the RBFkernel gives This data set is available at http:/ / www.broadinstitute. the maximum eigenvalue among all the five kernels. That org/ cgi-in/ cancer/ data sets.cgi. It contains 160 tissue means that RBF kernel produced the dominant results samples, 139 of which are of class ’0’ and the remaining compared to all other four kernels. For five data sets, colon are of class ’2’. Each sample is represented by the cancer data sets 2, 3, 4, 5, and 6: Lymphoma cancer data set, expression levels of 1,000 genes for each feature dimension. the polynomial kernel produced the second largest eigen- value. Linear kernel gave the second largest eigenvalue for 4.1.4 Lymphoma colon cancer data set 1 and lung cancer data set, where as This data set is available at http:/ / www.broad.mit.edu/ the Laplace kernel produced the second largest eigenvalue mpr/ lymphoma. It contains 77 tissue samples, 58 of which for the breast cancer data set. MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1871

TABLE 11 TABLE 13 Eigenvalues of Five Kernel Functions Mean-Square Reconstruction Error of (1)-(5) and Their Parameters Selected KPCA with Other Four Kernel Functions

TABLE 14 Linear Combination ^ for Selected Two Kernel Functions

TABLE 12 Mean-Square Reconstruction Error of KPCA, SKFA, and AKFA with the Selected Kernel Function

(RBF) selected from Table 11. We listed up the selected kernel, dimensions of the eigenspace (chosen empirically) and the reconstruction errors of both KPCA, SKFA, and AKFA for all the data sets shown in Table 12. Table 12 shows that RBF, the single kernel selected, has a relatively small reconstruction error, from 3.27 percent to up to 14.30 percent in KPCA. The reconstruction error of KPCA is less than that of the reconstruction error of AKFA, from 0.6 percent to up to 6.29 percent. The difference in the reconstruction error between KPCA and AKFA increased as the size of the data sets increased. This could be due to the heterogeneous nature of the data sets. The Lymphoma data As shown in Table 11, the Gaussian GBF kernel showed set produced the least mean square error, whereas the colon largest eigenvalues for the all nine data sets. The perfor- cancer data set 3 produced the largest mean-square error for mance of the selected Gaussian GBF was compared to the both KPCA and AKFA. other single kernel function in the reconstruction error Table 13 shows that the other four kernel functions have value. As a further experiment, the reconstruction error much more error than Gaissian RBF shown in Table 12. The results have been evaluated for KPCA using MErr ¼ difference between Tables 12 and 13 is more than four times n ð1=n Þ i¼lþ1 i , and for AKFA and SKFA using Erri ¼ larger reconstruction error, and sometimes 20 times when 0 2 T T jj i i jj ¼kii K i Cl Cl K i with the optimum kernel the other four kernel functions are applied. 1872 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

TABLE 15 TABLE 16 Mean-Square Reconstruction Error with Classification Accuracy Using Six Nearest Neighborhoods for Kernel Combinatory Optimization Single-Kernel and Two-Composite-Kernels with KPCA, SKFA, AKFA, and PC-KFA

TABLE 17 Overall Classification Comparison among Other Multiple Kernel Methods

4.3 Kernel Combination and Reconstruction After selecting the number of kernels, we select the first p kernels that produced the p largest Eigen values in Table 11, and combine them according to the method proposed in Section 3.2 to yield lesser reconstruction error. The follow- ing Table 14 shows the coefficients calculated for the linear combination of kernels. After obtaining the linear coeffi- cients according to (35), we combine the kernels according to (28) to generate the composite kernel matrix K compð Þ. The following Table 15 shows the reconstruction error results for both KPCA and AKFA along with the composite kernel K compð Þ. The reconstruction error using two composite kernel functions shown in Table 15 is smaller than the reconstruc- tion error in the single kernel function RBFin Table 12. This (TP þ TNþ FN þ FP). The results of classification accuracy would lead us to claim that all nine data sets from the above showed very high values as shown in Table 16. table made evident that the reconstruction ability of kernel The results from Table 16 indicate that the classification optimized KPCA and AKFA gives enhanced performance to accuracy of the composite kernel is better than that of the that of single kernel KPCA and AKFA. The specific single kernel for both KPCA and AKFA in colon cancer data improvement in the reconstruction error performance is det 1, Breast cancer, Lung Cancer, Lymphoma, and Prostate greater by up to 4.27 percent in the case of KPCA, and by up Cancer;whereas in thecase ofcolon cancer data sets2,3,4,5, to 5.84 and 6.12 percent in the cases of AKFA and SKFA by 6, because of the huge size of the data, the classification mean, respectively. The best improvement of the error accuracy is very similar between single and composite performance is observed in PC-KFA by 4.21 percent by kernels. From this quantitative characteristic among the mean. This improvement in reconstruction of all data sets is entire nine data sets, we can evaluate that the composite validated using PC-KFA. This successfully shows that the kernel improved the classification performance, and with composite kernel produces only a small reconstruction error. single and composite kernel cases the classification perfor- 4.4 Kernel Combination and Classification mance of AKFA is equally good as that of KPCA, from 85.48 percent up to 98.61 percent. The best classification To analyze how feature extraction methods affect classifica- performancehasbeenshowninPC-KFA,upto99.70percent. tion performance of polyp candidates, we used the k-nearest neighborhood classifier on the image vectors in the reduced 4.5 Comparisons of Other Composite Kernel eigenspace. We evaluated the performance of this simple Learning Studies classifier by applying to the kernel feature spaces obtained In this section, we make experimental comparisons of the byKPCAandAKFAwithbothselectedsinglekernelaswell proposed PC-KFA with other popular MKL technique. Such as composite kernel for all the nine data sets. Six nearest as regularized kernel discriminant analysis (RKDA) for neighbors were used for the classification purpose. The MKL [32], L2 regulation learning [71], and generality MKL classification accuracy was calculated as (TP þ TN)/ [69] in Table 17, as follows. MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1873

TABLE 18 PC-KFA Computation Time for Kernel Selection and Operation with KPCA, SKFA, AKFA, and PC-KFA

Fig. 1. The computation time comparison between KPCA, SKFA, AKFA, and PC-KFA as the data size increases.

increases, the computational gain of AKFA is greater. This To evaluate algorithms [32], [71], [69], the eight data sets indicates that AKFA is a powerful algorithm and it are used in the binary-class case from the UCI machine approaches the performance of KPCA by allowing for learning repository [61], [72]. L2 regularization learning [71] significant computational savings. showed miss-classification ratio, which may not be equally comparative to the other three methods. The proposed PC- KFA overperformed these representative approaches. For 5CONCLUSION example, PC-KFA for lung was 94.12 percent, not as good as This paper describes first AKFA, a faster and more the performance of SMKL, but better than RKDA for MKL efficient feature extraction algorithm derived from the and GMKL. The classification accuracy of RKDA for MKL in SKFA. The time complexity of AKFA is Oðln2Þ, which has Data Set 4 and prostate is better than GMKL. This result been shown to be more efficient than the Oðl2n2Þ time indicates PC-KFA is very competitive to the well-known complexity of SKFA and the complexity Oðn3Þ of a more classifiers for multiple data sets. systematic principal component analysis (KPCA). We have extended these methods into PC-KFA. By introducing a 4.6 Computation Time principal component metric in PC-KFA, the new criteria We finally evaluate the computational efficiency of the performed well in choosing the best kernel function adapted proposed PC-KFA method by comparing its runtime with to the data set, as well as extending this process of best KPCA and AKFA for all nine data sets as shown in Table 18. kernel selection into additional kernel functions by calculat- The algorithms have been implemented in Matlab R2007b ing linear composite kernel space. We conducted compre- using the statistical toolbox for the hensive experiments using nine cancer data sets for Gram matrix calculation and kernel projection. The proces- evaluating the reconstruction error, classification accuracy sor was a 3.2-GHz Intel Pentium 4 CPU with 3 GB of RAM. using a k-nearest neighbor classifier, and computational Runtime was determined using the CPU time command. time. The PC-KFA with KPCA and AKFA had a lower For each algorithm, computation time increases with reconstruction error compared to single kernel method, thus increasing training data size (n), as expected. AKFA demonstrating that the features extracted by the composite requires the computation of a Gram matrix whose size kernel method are practically useful to represent the data increases as the data size increases. The results from the sets. Composite kernel approach with KPCA and AKFA has table clearly indicate that AKFA is faster than KPCA. We the potential to yield high detection performance of polyps also noticed that the decrease in computation time for resulting in the accurate classification of cancers, compared AKFA compared to KPCA was relatively small, implying to the single kernel method. The computation time was also that the use of AKFA on a smaller training data set does evaluated across the variable data sizes, with a tradeoff in not yield much advantage over KPCA. However, as the the computational time and accuracy, and showed a data size increases, the computational gain for AKFA is comparative advantage of composite kernel AKFA. much larger than that of KPCA as shown in Fig. 1. PC- KFA shows more computational time since the composite data-dependent kernels needs calculations of a Gram ACKNOWLEDGMENTS matrix and optimization of coefficient parameters. The authors would like to express gratitude for the support Fig. 1 illustrates the increase in the computational time of of this research by the American Cancer Society Institu- both KPCA as well as AKFA corresponding to increased data. Using Table 18, we listed the sizes of all the data sets tional Research Grant through the Massey Cancer Center, in the ascending order from lymphoma (77)to colon cancer Presidential Research Incentive Program at Virginia Com- data set 4 (3021) on the X -axisversustherespective monwealth University, US National Science Foundation computational times on the Y-Axis.Theredcurveindicates ECCS #1054333 (CAREER Award), and USPHS Grant the computational time for the KPCA, whereas the blue R01CA095279 from National Cancer Institute. The work curve increases the computational time for AKFA for all reported here would not be possible without the help of nine data sets arranged in ascending order of their sizes. many of the past and present members of the laboratory, in This curve clearly shows that as the size of the data particular, D. Ma, A. Docef, S. Myla, and L. Winter. 1874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

REFERENCES [23] J.H. Friedman, “Flexible Metric Nearest Neighbor Classification,” technical report, Dept. of Statistics, Stanford Univ., Nov. 1994. [1] S. Winawer, R. Fletcher, D. Rex, J. Bond, R. Burt, J. Ferrucci, [24] T. Hastie and R. Tibshirani, “Discriminant Adaptive Nearest T. Ganiats, T. Levin, S. Woolf, D. Johnson, L. Kirk, S. Litin, Neighbor Classification,” IEEE Trans. Pattern Analysis and M achine and C. Simmang, “Colorectal Cancer Screening and Surveil- Intelligence, vol. 18, no. 6, pp. 607-616, June 1996. lance: Clinical Guidelines and Rationale-Update Based on New [25] D.G. Lowe, “Similarity Metric Learning for a Variable-Kernel Evidence,” Gastroenterology, vol. 124, pp. 544-560, Feb. 2003. Classifier,” Neural Computation, vol. 7, no. 1, pp. 72-85, 1995. [2] K.D.Bodily,J.G.Fletcher,T.Engelby,M.Percival,J.A.Christensen, [26] J. Peng, D.R. Heisterkamp, and H.K. Dai, “Adaptive Kernel Metric B. Young, A.J. Krych, D.C. Vander Kooi, D. Rodysill, J.L. Fidler, Nearest Neighbor Classification,” Proc. 16th Int’l Conf. Pattern and C.D. Johnson, “Nonradiologists as Second Readers for Recognition, vol. 3, pp. 33-36, 11-15 Aug. 2002. Intraluminal Findings at CT Colonography,” A cademi c Radi ol ogy, [27] Q.B. Gao and Z.Z. Wang, “Center-Based Nearest Neighbor vol. 12, pp. 67-73, Jan. 2005. Classifier,” PatternRecognition, vol. 40, pp. 346-349, 2007. [3] J.G. Fletcher, F. Booya, C.D. Johnson, and D. Ahlquist, “CT [28] S. Li and J. Lu, “Face Recognition Using the Nearest Feature Line Colonography: Unraveling the Twists and Turns,” Current Method,” I EEE Tr ans. N eur al N et wor ks, vol. 10, no. 2, pp. 439-443, Opi ni on i n Gast r oent er ol ogy, vol. 21, pp. 90-8, Jan. 2005. Mar. 1999. [4] H. Yoshida and A.H. Dachman, “CAD Techniques, Challenges, and Controversies in Computed Tomographic Colonography,” [29] P. Vincent and Y. Bengio, “K-local Hyperplane and Convex Abdominal Imaging, vol. 30, pp. 26-41, Jan./ Feb. 2005. Distance Nearest Neighbor Algorithms,” Pr oc. A dvances i n [5] H.YoshidaandJ.Na¨ppi, “Three-Dimensional Computer-Aided N eur al I nfor mat i on Pr ocessi ng Syst ems Conf., vol. 14, pp. 985- Diagnosis Scheme for Detection of Colonic Polyps,” IEEE Trans. 992, 2002. M edical Imaging, vol. 20, no. 12, pp. 1261-1274, Dec. 2001. [30] W. Zheng, L. Zhao, and C. Zou, “Locally Nearest Neighbor [6] R.M. Summers, C.F. Beaulieu, L.M. Pusanik, J.D. Malley, R.B. Classifiers for Pattern Classification,” PatternRecognition, vol. 37, JeffreyJr.,D.I.Glazer,andS.Napel,“AutomatedPolypDetector pp. 1307-1309, 2004. for CT Colonography: Feasibility Study,” Radiology, vol. 216, [31] T. Damoulas and M.A. Girolami, “Probabilistic Multi-Class Multi- pp. 284-290, 2000. Kernel Learning: On Protein Fold Recognition and Remote [7] R.M. Summers, M. Franaszek, M.T. Miller, P.J.Pickhardt, J.R. Choi, Homology Detection,” vol. 24, no. 10, pp. 1264-1270, 2008. andW.R.Schindler,“Computer-AidedDetectionofPolypson [32] J.Ye, S.Ji, and J.Chen, “Multi-Class Discriminant Kernel Learning Oral Contrast-Enhanced CT Colonography,” Am. J. Roentgenology, Via Convex Programming,” J. Machine Learning Research, vol. 9, vol. 184, pp. 105-108, Jan. 2005. pp. 719-758, 1999. [8] G. Kiss, J. Van Cleynenbreugel, M. Thomeer, P. Suetens, and [33] B. Souza and A. de Carvalho, “Gene Selection Based on Multi- G. Marchal, “Computer-Aided Diagnosis in Virtual Colono- Class Support Vector Machines and Genetic Algorithms,” graphy via Combination of Surface Normal and Sphere Fitting MolecularResearch, vol. 4, no. 3, pp. 599-607, 2005. Methods,” Eur opean Radi ol ogy, vol. 12, pp. 77-81, Jan. 2002. [34] B. Sch o¨lkopf and A.J. Smola, Learning with Kernels, pp. 211-214, [9] D.S.Paik,C.F.Beaulieu,G.D.Rubin,B.Acar,R.B.JeffreyJr.,J.Yee, MIT Press, 2002. J. Dey, and S. Napel, “Surface Normal Overlap: A Computer- [35] H. Xiong, Y. Zhang, and X.-W. Chen, “Data-Dependent Kernel Aided Detection Algorithm with Application to Colonic Polyps Machines for Microarray Data Classification,” IEEE/ACM Trans. and Lung Nodules in Helical CT,” IEEE Trans. M edical Imaging, Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 583-595, vol. 23, no. 6, pp. 661-675, June 2004. Oct.-Dec. 2007. [10] A.K. Jerebko, R.M. Summers, J.D. Malley, M. Franaszek, and [36] H. Xiong, M.N.S. Swamy, and M.O. Ahmad, “Optimizing the C.D. Johnson, “Computer-Assisted Detection of Colonic Polyps Data-Dependent Kernel in the Empirical Feature Space,” IEEE with CT Colonography Using Neural Networks and Binary Tr ans. N eur al N et wor ks, vol. 16, no. 2, pp. 460-474, Mar. 2005. Classification Trees,” M edi cal Physi cs, vol. 30, pp. 52-60, Jan. [37] G.C. Cawley, “MATLAB Support Vector Machine Toolbox, School 2003. of Information Systems, Univ. of East Anglia,” http:/ / theoval. [11] J. Na¨ppi and H. Yoshida, “Fully Automated Three-Dimensional sys.uea.ac.uk/ ~gcc/ svm/ toolbox, 2000. Detection of Polyps in Fecal-Tagging CT Colonography,” Academic [38] Y. Raviv and N. Intrator, “Bootstrapping with Noise: An Radiology, vol. 14, pp. 287-300, Mar. 2007. Efficient Regularization Technique,” Connection Science, vol. 8, [12]A.K.Jerebko,J.D.Malley,M.Franaszek,andR.M.Summers, pp. 355-372, 1996. “Multiple Neural Network Classification Scheme for Detection of [39] H .A. Bu ch h old t, Structural Dynamics for Engineers. Thomas Colonic Polyps in CT Colonography Data Sets,” Academic Telford, 1997. Radiology, vol. 10, pp. 154-160, Feb. 2003. [40] R.O.Duda,P.E.Hart,andD.G.Stork,PatternClassification, second [13]A.K.Jerebko,J.D.Malley,M.Franaszek,andR.M.Summers, ed. John Wiley & Sons Inc., 2001. “Support Vector Machines Committee Classification Method for [41] Pothin and Richard, “A Greedy Algorithm for Optimizing the Computer-Aided Polyp Detection in CT Colonography,” Academic Kernel Alignment and the Performance of Kernel Machines,” Proc. Radiology, vol. 12, pp. 479-486, Apr. 2005. 14th European Signal Processing Conf. (EUSIPCO ’06), Sept. 2006. [14] V.N. Vapnik, The Nature of Statistical Learning Theory, second ed. [42] J. Kandola, J. Shawe-Taylor, and N. Cristianini, “Optimizing Springer, 2000. Kernel Alignment over Combinations of Kernels,” technical report [15] R. Courant and D. Hilbert, Methodsof Mathematical Physics, Vol. 1 NeuroCOLT, 2002. edition, Wiley-VCH, 1989. [16] O.L. Mangasarian, A.J. Smola, and B. Scho¨lkopf, “Sparse Kernel [43] M.E. Tipping, “Spare Kernel Principal Component Analysis,” Feature Analysis,” Technical Report 99-04, Univ. of Wisconsin, Pr oc. N eur al I nfor mat i on Pr ocessi ng Syst ems Conf. (N I PS ’ 00), 1999. pp. 633-639, 2000. [17] J. Franc and V. Hlavac, “Statistical Pattern Recognition Toolbox [44] H . Fr o¨hlich, O. Chappelle, and B. Scholkopf, “Feature forMatlab,” http://cmp.felk.cvut.cz/cmp/software/stprtool/, Selection for Support Vector Machines by Means of Genetic 2004. Algorithm,” Proc. IEEE 15th Int’l Conf. Tools with Artificial [18] “Partners Research Computing,” http:/ / www.partners.org/ Intelligence, pp. 142-148, 2003. rescomputing/ , 2006. [45] S. Mika, G. Ratsch, and J. Weston, “Fisher Discriminant Analysis [19] A.J. Sm ola and B. Scho¨lkopf, “Sparse Greedy Matrix Approx- with Kernels,” Pr oc. N eur al N et wor ks for Si gnal Pr ocessi ng Wor kshop, imation for Machine Learning,” Proc. 17th Int’l Conf. Machine pp. 41-48, 1999. Learning, 2000. [46] X.W. Chen, “Gene Selection for Cancer Classification Using [20] X. Jiang, R.R. Snapp, Y. Motai, and X. Zhu, “Accelerated Kernel Bootstrapped Genetic Algorithms and Support Vector Machines,” Feature Analysis,” Pr oc. I EEE CS Conf. Comput er V i si on and Pat t er n Pr oc. I EEE I nt ’ l Conf. Comput at i onal Syst ems, Bi oi nfor mat i cs, pp. 504- Recognition, pp. 109-116, 2006. 505, 2003. [21] V. France and V. Hlavaˇc, “Greedy Algorithm for a Training set [47] C. Park and S.B. Cho, “Genetic Search for Optimal Ensemble of Reduction in the Kernel Methods,” Comput er A nal ysi s of I mage and Feature-Classifier Pairs in DNA Gene Expression Profiles,” Proc. Patterns, vol. 2756, pp. 426-433, 2003. Int’l Joint Conf. Neural Networks, vol. 3, pp. 1702-1707, 2003. [22] K. Fukunaga and L. Hostetler, “Optimization of K-Nearest [48] F.A. Sadjadi, “Polarimetric Radar Target Classification Using Neighbor Density Estimates,” I EEE Tr ans. I nfor mat i on Theor y, Support Vector Machines,” Optical Eng., vol. 47, no. 4, p. 046201, vol. IT-19, no. 3, pp. 320-326, May 1973. Apr. 2008. MOTAI AND YOSHIDA: PRINCIPAL COMPOSITE KERNEL FEATURE ANALYSIS: DATA-DEPENDENT KERNEL APPROACH 1875

[49] T. Briggs and T. Oates, “Discovering Domain Specific Composite [71] C. Cortes, M. Mohri, and A. Rostamizadeh, “L2 Regularization for Kernels” Proc. 20th Nat’l Conf. Artificial Intelligenceand 17th Ann. Learning Kernels,” Pr oc. 25t h Conf. i n U ncer t ai nt y i n A r t i fi ci al Conf. InnovativeApplicationsArtificial Intelligence, pp. 732-739, 2005. Intelligence, 2009. [50] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, “On [72] D.J.Newman,S.Hettich,C.L.Blake,andC.J.Merz,U CI Reposi t or y Kernel Target Alignment,” Pr oc. N eur al I nfor mat i on Pr ocessi ng of M achi ne Lear ni ng D at abases, URL http:/ / www.ics.uci.edu/ Syst ems Conf. (N I PS ’ 01), pp. 367-373. mlearn/ MLRepository.html. 1998. [51]M.A.Shipp,K.N.Ross,P.Tamayo,A.P.Weng,J.L.Kutok,R.C.T. [73] K. Chaudhuri, C. Monteleoni, and A.D. Sarwate, “Differentially Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Private Empirical Risk Minimization,” J. Machine Learning Re- Ray,M.A.Koval,K.W.Last,A.Norton,T.A.Lister,J.Mesirov, sear ch, vol. 12, pp. 1069-1109, Mar. 2011. D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse [74] V. Vapnik and O. Chapelle, “Bounds on Error Expectation for Large B-Cell Lymphoma Outcome Prediction by Gene Expression Support Vector Machines,” Neural Computation, vol. 12, no. 9, Profiling and Supervised Machine Learning,” Nature Medicine, pp. 2013-2036, Sept. 2000. vol. 8, pp. 68-74, 2002. [75] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, [52] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of “Choosing Multiple Parameters for Support Vector Machines,” Discrimination Method for the Classification of Tumor Using MachineLearning, vol. 46, pp. 131-159, 2002. Gene Expression Data,” J. A m. St at i st i cal A ssoc., vol. 97, pp. 77-87, [76] K.M. Chung, W.C. Kao, C.-L. Sun, L.-L. Wang, and C.J. Lin, 2002. “Radius Margin Bounds for Support Vector Machines with the RBF Kernel,” Neural Computation, vol. 15, no. 11, pp. 2643-2681, [53] C. Sotiriou et al., “Gene Expression Profiling in Breast Cancer: Nov. 2003. Understanding the Molecular Basis of Histologic Grade to [77]J.Yang,Z.Jin,J.Y.Yang,D.Zhang,andA.F.Frangi,“Essenceof Improve Prognosis,” J. N at ’ l Can cer I n st . , vol. 98, no. 4, pp. 262- Kernel Fisher Discriminant: KPCA Plus LDA,” Pat t er n Recogn i t i on , 272, Feb. 2006. vol. 37, pp. 2097-2100, 2004. [54] U.R. Chandran et al., “Gene Expression Profiles of Prostate Cancer Reveal Involvement of Multiple Molecular Pathways in the Yuichi Motai (M’01) received the BEng degree Metastatic Process,” BM C Cancer, pp. 7-64, Apr. 2007. in instrumentation engineering from Keio Uni- [55] Y.P. Yu et al., “Gene Expression Alterations in Prostate Cancer versity, Tokyo, Japan, in 1991, the MEng Predicting Tumor Aggression and Preceding Development of degree in applied systems science from Kyoto Malignancy,” J. Clinical Oncology, vol. 22, no. 14, pp. 2790-2799, University, Japan, in 1993, and the PhD degree July 2004. PMID: 15254046. in electrical and computer engineering from [56] W. Zheng, C. Zou, and L. Zao, “An Improved Algorithm for Purdue University, West Lafayette, Indiana, in Kernel Principal Component Analysis,” Neural Processing Letters, 2002. He is currently an associate professor of vol. 22, no. 1, pp. 49-56, 2005. electrical and computer engineering at Virginia [57] C. Ding and I. Dubchak, “Multi-Class Protein Fold Recognition Commonwealth University, Richmond. His re- Using Support Vector Machines and Neural Networks,” Bi oi nfor - search interests include the broad area of sensory intelligence, matics, vol. 17, no. 4, pp. 349-358, Apr. 2001. particularly in online classification and target tracking, applied to [58] P. Pavlidis et al., “Learning Gene Functional Classifications from medical imaging, pattern recognition, , and robotics. Multiple Data Types,” J. Computational Biology, vol. 9, no. 2, He is a senior member of the IEEE. pp. 401-411, 2002. [59] O. Chapelle et al., “Choosing Multiple Parameters for Support Hiroyuki Yoshida (M’96) received the BS and Vector Machines,” MachineLearning, vol. 46, nos. 1-3, pp. 131-159, MS degrees in physics and the PhD degree in 2002. information science from the University of Tokyo, [60] C.S. Ong, A.J. Smola, and R.C. Williamson, “Learning the Kernel Japan, in 1989. He previously held an assistant with Hyperkernels,” J. M achi n e L ear n i n g Resear ch, vol. 6, pp. 1043- professorship in the Department of Radiology at 1071, July 2005. the University of Chicago. He was a tenured associate professor when he left the university [61] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. and joined the Massachusetts General Hospital Jordan, “Learning the Kernel Matrix with Semidefinite Program- (MGH) and Harvard Medical School (HMS) in ming,” J.MachineLearning Research, vol. 5, pp. 27-72, 2004. 2005, where he is currently the director of 3D [62] V. Atluri, S. Jajodia, and E. Bertino, “Transaction Processing in Imaging Research in the Department of Radiology, MGH, and an Multilevel Secure Databases with Kernelized Architecture: Chal- associate professor of radiology at HMS. His research interests include lenges and Solutions,” IEEE Trans. Knowledge and Data Eng., vol. 9, computer-aided diagnosis and quantitative medical imaging, in particular no. 5, pp. 697-708, Sept./ Oct. 1997. the diagnosis of colorectal cancers in CT colonography, for which he has [63]H.Cevikalp,M.Neamtu,andA.Barkana,“TheKernelCommon been the principal investigator on eight national research projects funded Vector Method: A Novel Nonlinear Subspace Classifier for Pattern by the National Institutes of Health and two cancer projects by the Recognition,” I EEE Tr ans. Syst ems, M an, and Cyber net i cs, Par t B: American Cancer Society, and received nine awards (Magna Cum Cybernetics, vol. 37, no. 4, pp. 937-951, Aug. 2007. Laude, two Cum Laude, four Certificate of Merit, and two Excellences in [64] G. Horvath and T. Szabo, “Kernel CMAC With Improved Design) from the Annual Meetings of Radiological Society of North Capability,” I EEE Tr ans. Syst ems, M an, and Cyber net i cs, Par t B: America and Honorable Mention award from the SPIE Medical Imaging. Cybernetics, vol. 37, no. 1, pp. 124-138, Feb. 2007. He is the author and coauthor of more than 100 papers and 16 book [65] C. Heinz and B. Seeger, “Cluster Kernels: Resource-Aware Kernel chapters, author and editor of 10 books, and inventor and coinventor of Density Estimators over Streaming Data,” I EEE Tr ans. Knowl edge eight patents. He was a guest editor of the IEEE Transaction on Medical and D at a Eng., vol. 20, no. 7, pp. 880-893, July 2008. Imaging in 2004, and currently serves on the editorial boards of the [66] S. Amari and S. Wu, “Improving Support Vector Machine International Journal of Computer Assisted Radiology, the International Classifiers by Modifying Kernel Functions,” Neural Network, Journal of Computers in Healthcare, and the Intelligent Decision vol. 12, no. 6, pp. 783-789, 1999. Technology: An International Journal. He is a member of the IEEE. [67] Y. Tan and J. Wang, “A Support Vector Machine with a Hybrid Kernel and Minimal Vapnik-Chervonenkis Dimension,” IEEE Tr ans. Knowl edge and D at a Eng., vol. 16, no. 4, pp. 385-395, Apr. . For more information on this or any other computing topic, 2004. please visit our Digital Library at www.computer.org/publications/dlib. [68] A. Rakotomamonjy, F.R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” J.MachineLearning Research, vol. 9, pp. 2491-2521, Nov. 2008. [69] M. Varma and B.R. Babu, “More Generality in Efficient Multiple Kernel Learning,” Proc. Int’l Conf. Machine Learning, June 2009. [70]K.Duan,S.S.Keerthi,andA.N.Poo,“EvaluationofSimple Performance Measures for Tuning SVM Hyperparameters,” N eur ocomput i ng, vol. 51, pp. 41-59, Apr. 2003.