Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Fredholm Multiple Kernel Learning for Semi-Supervised Domain Adaptation

Wei Wang,1 Hao Wang,1,2 Chen Zhang,1 Yang Gao1 1. Science and Technology on Integrated Information System Laboratory 2. State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing 100190, China [email protected]

Abstract port vector machine, and statistic clas- sifiers, to the target domain. However, they do not ex- As a fundamental constituent of , domain plicitly explore the distribution inconsistency between the adaptation generalizes a learning model from a source do- main to a different (but related) target domain. In this paper, two domains, which essentially limits their capability of we focus on semi-supervised domain adaptation and explic- knowledge transfer. To this end, many semi-supervised do- itly extend the applied range of unlabeled target samples into main adaptation methods are proposed within the classifier the combination of distribution alignment and adaptive clas- adaptation framework (Chen, Weinberger, and Blitzer 2011; sifier learning. Specifically, our extension formulates the fol- Duan, Tsang, and Xu 2012; Sun et al. 2011; Wang, Huang, lowing aspects in a single optimization: 1) learning a cross- and Schneider 2014). Compared with supervised ones, they domain predictive model by developing the Fredholm integral take into account unlabeled target samples to cope with the based kernel prediction framework; 2) reducing the distribu- inconsistency of data distributions, showing improved clas- tion difference between two domains; 3) exploring multiple sification and generalization capability. Nevertheless, most kernels to induce an optimal learning space. Correspondingly, work in this area only incorporates unlabeled target samples such an extension is distinguished with allowing for noise re- siliency, facilitating knowledge transfer and analyzing diverse in distribution alignment, but not in adaptive classifier learn- data characteristics. It is emphasized that we prove the differ- ing. Large quantities of unlabeled target samples potentially entiability of our formulation and present an effective opti- contain rich information and incorporating them is desir- mization procedure based on the reduced gradient, guaran- able for noise resiliency. Moreover, some work in this semi- teeing rapid convergence. Comprehensive empirical studies supervised setting is customized for certain classifiers, and verify the effectiveness of the proposed method. the extension to other predictive models remains unclear. In conventional , kernel methods pro- Introduction vide a powerful and unified prediction framework for build- ing non-linear predictive models (Rifkin, Yeo, and Poggio Conventional supervised learning aims at generalizing a 2003; Shawe-Taylor and Cristianini 2004; Yen et al. 2014). model inferred from labeled training samples to test sam- To further incorporate unlabeled samples with labeled ones ples. For well generalization capability, it is required to col- in model learning, the kernel prediction framework is de- lect and label plenty of training samples following the same veloped into semi-supervised setting by formulating a regu- distribution of test samples, however, which is extremely ex- larized Fredholm integral equation (Que, Belkin, and Wan pensive in practical applications. To relieve the contradiction 2014). Although this development is proven theoretically between generalization performance and label cost, domain and empirically to be effective in noise suppression, the per- adaptation (Pan and Yang 2010) has been proposed to trans- formance heavily depends on the choice of a single prede- fer knowledge from a relevant but different source domain fined kernel function. The reason is that its solution is based with sufficient labeled data to the target domain. It gains in- on Representer Theorem (Scholkopf and Smola 2001) in the creased importance in many applied areas of machine learn- Reproducing Kernel Hilbert Space (RKHS) induced by the ing (Daume´ III 2007; Pan et al. 2011), including natural lan- kernel. More importantly, the kernel methods are not devel- guage processing, computer vision and WiFi localization. oped for domain adaptation, and the distribution difference Supervised domain adaptation takes advantage of both will invalidate the predictive models across two domains. abundant labeled samples from source domain and a rel- In this paper, we focus on semi-supervised domain adap- atively small number of labeled samples from target do- tation and extend the applied range of unlabeled target sam- main. A line of recent work in this supervised setting (Ay- ples from the distribution alignment into the whole learn- tar and Zisserman 2011; Daume´ III and Marcu 2006; Liao, ing process. This extension further explores the structure in- Xue, and Carin 2005; Hoffman et al. 2014) learns cross- formation in target domain, enhancing robustness in com- domain classifiers by adapting traditional models, e.g., sup- plex knowledge propagation. The key challenge is to accom- Copyright c 2017, Association for the Advancement of Artificial plish this extension through a single optimization combining Intelligence (www.aaai.org). All rights reserved. the following three aspects: 1) learning a predictive model

2732 across two domains based on Fredholm integrals utilizing the following optimization problem over H: (labeled) source data and (labeled and unlabeled) target data; n 1 2 2) reducing the distribution difference between domains; 3) f = arg min L(f(xi),yi)+βfH, (1) f∈H n exploring a convex combination of multiple kernels to op- i=1 timally induce the RKHS. In dealing with the above diffi- where β is a tradeoff parameter, L(z,y) is a risk function, culties, the proposed algorithm named Transfer Fredholm e.g., square loss: (z − y)2 and hinge loss: max(1 − zy,0). Multiple Kernel Learning (TFMKL) has a three-fold contri- In semi-supervised setting, suppose the labeled and the bution. First, compared with traditional semi-supervised do- unlabeled training data are drawn from the distribution main adaptation, TFMKL learns a predictive model for noise P (X, Y) and the marginal distribution P (X) respectively. resiliency and improved adaptation power. It is also of high Associated with an outside kernel kP (x, z), the regular- expansibility due to the compatibility with many classifiers. ized Fredholm integral KP for f(x) (Que, Belkin, and Wan Second, from the view of kernel prediction, the challenge of 2014) is introduced to explore unlabeled data: KP f(x)= reducing distribution difference is suitably addressed. More- P (x z) (z) (z) z k , f P d . In this term, Eq. (1) can be rewrit- over, multiple kernels are optimally combined in TFMKL, 1 n 2 ten as: f = arg min L(KP f(xi),yi)+βf . It allowing for analyzing the useful data characteristics from n i=1 H has been shown that the performance of supervised algo- various aspects and enhancing the interpretability of predic- rithms could be degraded under the “noise assumption”. By tive model. Third, instead of employing alternate optimiza- contrast, the f based on the Fredholm integral will be re- tion technique, we prove the differentiability of our formula- silient to noise and closer to the optimum, because this inte- tion and propose a simple but efficient procedure to perform gral provides a good approximation to the “true” data space. reduced , guaranteeing rapid convergence. However, this framework is effective only when labeled and Experimental results on a synthetic example and two real- unlabeled data follow the same distribution. In addition, it world applications verify the effectiveness of TFMKL. P P relies on the single k and the choice of k heavily influ- ences the performance. Relative Work Domain Adaptation Method In this section, the problem in TFMKL is firstly defined. In supervised domain adaptation, cross-domain classi- Second, we improve the kernel prediction framework and fiers (Aytar and Zisserman 2011; Hoffman et al. 2014) extend it to semi-supervised domain adaptation. Third, are learnt by using labeled source samples and a small the distribution difference is minimized, which is a cru- number of labeled target samples. Meanwhile, some semi- cial element to gain more support from source to tar- supervised methods (Chen, Weinberger, and Blitzer 2011; get domain. Fourth, the above two goals are unified to- Duan, Tsang, and Xu 2012; Sun et al. 2011; Wang, Huang, gether in TFMKL. Furthermore, Multiple Kernel Learn- and Schneider 2014) are proposed by combining the trans- ing (MKL) (Bach, Lanckriet, and Jordan 2004) is exploited fer of classifiers with the match of distributions. This setting within the framework to improve the flexibility. Finally, we encourages the domain-invariant characteristics, leading to propose a reduced gradient descent procedure to update the improved adaptation results for classification. Specifically, kernel function and the predictive model simultaneously, ex- (Chen, Weinberger, and Blitzer 2011) minimizes conditional plicitly enabling rapid convergence. distribution difference and adapts logistic regression to tar- get data simultaneously. Inspired from Maximum Mean Dis- Notations and Settings crepancy (Borgwardt et al. 2006), the methods in (Sun et al. Suppose the data originates from two domains, i.e., source 2011; Wang, Huang, and Schneider 2014) match conditional and target domains. The source data is a set of ns fully or marginal distributions by transforming or re-weighting, s s s s d labeled points Ds = {(x ,y ), ..., (x ,y )}⊂R × and then use all processed labeled data to train traditional 1 1 ns ns Y drawn from the distribution Ps(X, Y). The target data classifiers. The above methods ignore unlabeled target data l is divided into nl (nl  ns) labeled points Dt = in the process of learning adaptive classifiers, while the unla- {(xt t ) (xt t )}⊂Rd ×Y 1,y1 , ..., nl ,ynl from the distribution beled data is desirable for robustness and noise resiliency. In u Pt(X, Y) and nu (nu  nl) unlabeled points Dt = (Duan, Tsang, and Xu 2012), although unlabeled target data t t d {x , ..., x }⊂R from the marginal distribution is used during learning adaptive Support Vector Regression nl+1 nl+nu Pt(X). The label set Y is assumed to be either {+1, −1} in (SVR), it lacks theoretical support for noise suppression and R is specific to SVR. binary classification or the real line in regression for sim- plicity, but it can be easily generalized to multi-class setting. By making full use of all available data (i.e., labeled and un- Kernel Prediction Framework labeled points) in every learning aspect, the task of TFMKL d In conventional supervised setting, the kernel prediction is to find an optimal predictive model f : R →Yhaving framework (Shawe-Taylor and Cristianini 2004) specifies a low prediction error with respect to the target domain. positive definite kernel function K and then estimates a pre- dictive function f : X→Yfrom the labeled training set Adaptive Kernel Prediction Framework D = {(x1,y1), ..., (xn,yn)}⊂X×Y. Let H be the RKHS Fredholm Integral on Source Domain Suppose an out- induced by K, and this estimation is generally modeled as side kernel ks(xs, zs) is given for source domain, where xs

2733 s and z follow the marginal distribution Ps(X). As stated in TFMKL formulates the predictive model f and the kernel K K the relative work, we define the Fredholm integral Ps as: in a unified framework. Taking advantage of unlabeled (xs)= s(xs zs) (zs) (zs) zs KPs f k , f Ps d . Following the samples, the optimal f and K are found by simultaneously K (xs) law of large number, Ps f can be approximated with matching the distributions and minimizing the prediction er- xs xs the labeled data points 1, ..., ns , as: rors on both source and target data. Based on Eq. (4) and Eq. (5), the learning framework is concisely written as: ns s 1 s s s s KPs f(x )= k (x , xi )f(xi ). (2) [f,K]=arg min (f,K,Ds,Dt)+θΩ(dK(Ds,Dt)), (6) s f,K n i=1 Ω(·) Fredholm Integral on Target Domain kt(xt, zt) is given where is a monotonic increasing function and θ is the xt zt (X) K tradeoff parameter to balance the mismatch and the errors. for target domain, where  and follow Pt . Pt is K (xt)= t(xt zt) (zt) (zt) zt defined as: Pt f k , f Pt d . As- Multiple Kernel Learning Instead of predefining the ker- sociated with the labeled and the unlabeled data points nel K in Eq. (6) as the standard kernel prediction frame- xt xt K (xt) 1, ..., nl+nu , Pt f can be approximated as: work, our framework explicitly explores the optimal K for domain adaptation. However, directly learning a non- nl+nu t 1 t t t t parametric kernel matrix by solving an semidefinite pro- KPt f(x )= k (x , xi)f(xi). (3) l + u n n i=1 gramming (Boyd and Vandenberghe 2004) is computation- ally prohibitive with O(n6.5) complexity. As an efficient Minimizing Prediction Error To eliminate the assump- solution to this dilemma, MKL considers the learnt ker- tion that both source and target data follow the identical K nel as a convex combination of given (base) kernels m: distribution in traditional kernel prediction, we have con- M K(xi, xj)= dmKm(xi, xj), where dm ≥ 0 and structed two Fredholm integrals on the two domains respec-  m=1 M =1 tively. Furthermore, we adjust the relative importance of m=1 dm . Consequently, the problem of kernel learn- each domain by parameters λs and λt, and then estimate f ing is translated to the choice of optimal weights dm, and Ω( ( )) by minimizing the combined prediction error: thus the dK Ds,Dt in Eq. (6) can be rewritten into: 2 M f =arg min (f,K,Ds,Dt)=arg min{βfH 2 T T f∈H f∈H Ω(dK(Ds,Dt)) = (tr(( dmKm)S)) = d pp d, (7) m=1 ns nl λs s s λt t t + (K (x ) )+ (K (x ) )} T T L Ps f i ,yi L Pt f i ,yi , d =[ ] p =[ ] = ns nl where d1, ..., dM , p1, ..., pM with pm i=1 i=1 (K S) (4) tr m . Due to the advantages in exploring prior knowl- where H is some RKHS defined by a kernel K. Eq. (4) in- edge, describing data characteristics and enhancing inter- tuitively considers the gap between the two domains, hence pretability, the adoption of MKL provides improved perfor- naturally boosts the performance on target domain. mance and generalization for our proposed algorithm. Discussion: In contrast to previous semi-supervised do- Reducing Mismatch of Data Distributions main adaptation methods, the proposed framework provides a natural way of incorporating unlabeled target data in two In this section, we explicitly minimize the distribution dif- parts: 1) learning the adaptive predictive model f; 2) min- ference between domains. Following (Pan et al. 2011), the imizing the distribution difference. The first part benefits empirical Maximum Mean Discrepancy (MMD) (Borgwardt robustness and noise resilience for domain adaptation. Dif- et al. 2006) is adopted as the measure of comparing marginal ferent from “cluster assumption” (Chapelle, Weston, and distributions. Specifically, given the source data Ds and the Scholkopf¨ 2003) or “manifold assumption” (Belkin, Niyogi, target data Dt, the distance between the data distributions of and Sindhwani 2006), the Fredholm integral is discussed un- two domains in the RKHS H induced by the kernel K can be der the “noise assumption” (Que, Belkin, and Wan 2014): in estimated as the distance between the empirical data means: the neighbor of every sample, the directions with low vari- ns nu+nl ance are uninformative with respect to class labels and can 1 s 1 t 2 dK(Ds,Dt)= φ(xi ) − φ(xi)H be regarded as noise. Based on this assumption, the Fred- ns nu + nl i=1 i=1 (5) holm integral is proven to have noise-suppression power. = tr(ΦT ΦS)=tr(KS), Specifically, when samples are polluted by noise, it will pro- d T vide a more accurate estimate of the true data, and the re- where φ : R →Hwith K(xi, xj)=φ(xi) φ(xj), sulting f will be more closer to the true optimum. The sec- Φ={ (xs) (xs ) (xt ) (xt )} S( )= φ 1 , ..., φ ns ,φ 1 , ..., φ nl+nu , i, j ond part facilitates the knowledge transfer from Ds to Dt, 2 2 1/ns if xi, xj ∈ Ds; S(i, j)=1/(nl +nu) if xi, xj ∈ Dt; thus f trained in H will make high-confidence predictions u S(i, j)=−1/ns(nl + nu) otherwise. Based on the optimal on Dt . Note that our domain adaptation setting is different K obtained by minimizing Eq. (5), the two distributions are from multi-task learning which tries to learn both target and drawn close in the induced H. source tasks (Evgeniou, Micchelli, and Pontil 2005). Transfer Fredholm Multiple Kernel Learning Implementation and Optimization Semi-Supervised Learning Framework For achieving TFMKL Using Square-Loss For a concrete implementa- the noise resiliency and assisting the information transfer, tion, square loss is applied in Eq. (6), but it is emphasized

2734 that other loss functions can also be employed, e.g., hinge Algorithm 1 Transfer Fredholm Multiple Kernel Learning loss and logistic loss. Substituting Eq. (7) into Eq. (6), the Initialization: M square loss based formulation is: Set {dm}m=1 with random admissible values. ns Iteration: 2 λs s s 2 min {βfH + (KPs f(xi ) − yi ) 1: while not convergence do f,d ns ∗ ∗ M i=1 2: Compute v and f with {dm}m=1. ∂J nl 3: Compute and D. λt t t 2 T T ∂dm + (KPt f(xi) − yi ) + θd pp d} (8) 4: d ← d + γD, where γ is the step size. nl i=1 5: end while u M The final prediction onDt : ∗ t 1 nl+nu t t t ∗ t s.t. dm ≥ 0, dm =1. KP f (xi)= k (xi, xj )f (xj ). t nl+nu j=1 m=1 One standard approach for optimizing TFMKL is to alter- natively update f and d, however, it lacks the convergence t t t t i ≤ ns, 1 ≤ j ≤ ns, and (K )i,j = k (xi, xj ) for 1 ≤ i ≤ nl, guarantees and may lead to numerical problems. Therefore, 1 ≤ j ≤ nl + nu. we make great efforts to propose a solution to this challenge. K ∗(xs) K ∗(xt) Using Eq. (11), Psf i and Pt f i can be vector- Generally, the cost function in Eq. (8) can be rewritten as: ∗ s M s ∗ ∗ t ized as: KPs f (xi )= m=1 dmki Kmv and KPt f (xi)= M M t ∗ s s s T m=1 dmkiKmv , where ki =1/ns[(K )i,1, ..., (K )i,ns ] min J(d) with dm ≥ 0, dm =1, (9) t t t T d and ki =1/(nl + nu)[(K)i,1, ..., (K )i,nl+nu ] . As the opti- m=1 (d) ns mal objective value of Eq. (10), function J is equal to the 2 λs s s 2 where J(d) = min{ βfH + (KPs f(xi ) − yi ) following expression: f∈H s n i=1 (10) M ns M nl ∗ T ∗ λs s ∗ s 2 λt t t 2 T T β dm(v ) Kmv + ( dmki Kmv −yi ) + (K (x ) − ) + d pp d} s Pt f i yi θ . m=1 n i=1 m=1 l n i=1 (13) nl M In the following paragraphs, we will firstly prove the dif- λt t ∗ t 2 T T + ( dmkiKmv − yi ) + θd pp d. (·) l ferentiability of J , which is at the core of the optimiza- n i=1 m=1 tion. Afterwards, Problem (9) can be optimized by a reduced J(·) ∂J gradient method, ensuring the convergence to local mini- Proposition 2. is differentiable and ∂dm can be calcu- mum. It should be noticed that the above procedure holds lated by the differentiation of Eq. (13) with respect to dm. for TFMKL with other loss functions, as long as the loss ∗ functions provide differentiable J(·). For instance, the pro- Proof. The closed-form solution of f ensures its unicity for d cedure is applicable to TFMKL with hinge loss, because we any admissible value of . Following Theorem 4.1 in (Bon- naus and Shaoiro 1998), J(·) can be proven to be differen- can prove that it has a SVM-like formulation and provides a ∗ differentiable J(·) in a similar way as the proof of TFMKL tiable based on the unicity of f . Furthermore, the theorem enables to calculate the derivative ∂J by the direct differ- using square loss. ∂dm entiation of Eq. (13) with respect to dm. The Differentiability of J(·) Although Eq. (10) is an opti- mization problem in a potentially infinite dimensional space The Reduced Gradient Algorithm The existence and the H, the following proposition reduces Eq. (10) to a finite di- computation of the gradient of J(·) have been discussed. mensional problem. For solving Problem (9), we propose an efficient and ef- Proposition 1. The solution of Eq. (10) has the form: fective procedure which performs the reduced gradient de- scent on the differentiable J(·). This procedure does con- ns+nl+nu M ∗ J(·) f (x)= dmKm(x, xi)vi, verge to the local minimum of (Luenberger 1984). (11) d i=1 m=1 Specifically, when the gradient is obtained, is updated in n +n +n the descent direction D computed using Eq. (12) in (Rako- for some v ∈R s l u . tomamonjy et al. 2008). Meanwhile, D ensures that the The proof is similar to that of Representer Theo- constraints {d| m dm =1,dm > 0} are satisfied. The KP rem (Scholkopf and Smola 2001). Given s in Eq. (2) and procedure is summarized in Algorithm 1 with O(Tmax × KP 3 t in Eq. (3), Eq. (10) is translated to the quadratic opti- (ns + nl + nu) ) complexity, where Tmax is the itera- v∗ mization with the following closed-form solution based tion number. It has been shown that this update scheme on Proposition 1: leads to rapid convergence (Rakotomamonjy et al. 2008; M ∗ ¯ T ˜ −1 ˜ T Wang et al. 2015). v =( dmK KKm + βI) K y, (12) m=1   Experiments Ks 0 s s t t T ¯ ns ˜ where y =[y1, ..., yn ,y1, ..., yn ] , K = t , K = Experiment Setup s l 0 K nl+nu  s  TFMKL is systematically compared with: 1) SVM trained λsK 2 0 ns s s s s on the union of labeled source and labeled target data (SVM- t . Note that (K )i,j = k (xi , xj ) for 1 ≤ 0 λtK st); 2) SVM trained on labeled target data (SVM-t); 3) the (nl+nu)nl

2735 kernel prediction framework which simply combines (la- beled) source data and (labeled and unlabeled) target data in Table 1: Classification accuracy and standard error. the Fredholm integral operator by ignoring distribution dif- SVM-st Fred-st DTMKL-f TFMKL ference (Fred-st) (Que, Belkin, and Wan 2014); 4) the ker- 87.86±0.02 90.95±0.01 91.07±0.01 96.02 ±0.11 nel prediction framework trained on (labeled and unlabeled) target data in the Fredholm integral operator (Fred-t); 5) un- supervised Kernel Mean Matching (KMM) (Huang et al. Cross-Domain Object Recognition 2006); 6) unsupervised Geodesic Flow Kernel (GFK) (Gong Data Preparation Amazon, DSLR, Webcam and Caltech- et al. 2012) + 1NN; 7) supervised Maximum Margin Do- 256 are four benchmark databases widely used for visual main Transform (MMDT) (Hoffman et al. 2014) and 8) domain adaptation evaluation (Gong et al. 2012). The 10 semi-supervised Domain Transfer Multiple Kernel Learning classes common to these four image databases are ex- (DTMKL-f) (Duan, Tsang, and Xu 2012). tracted, yielding 2,533 images in total. Each database is We evaluate all the methods by empirically search- regarded as a domain and 12 cross-domain image data ing the parameter space, and the best results are re- sets are constructed: A→D, A→W, A→C, D→A, D→W, ported. For SVM-st, SVM-t, DTMKL-f and KMM, we D→C,...,C→W. SURF features are extracted and quantized choose C ∈{0.001, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100}. to 800-bin histograms with the codebook trained from a sub- For Fred-st, Fred-t and TFMKL, β is searched in the range −7 1 set of Amazon images. [10 , 10 ]. For DTMKL-f, we search θ, λ and ζ in the range −2 2 [10 , 10 ]. The parameters λs, λt and θ in TFMKL are Results of Visual Object Recognition Following (Hoff- searched in the range [10−1, 101]. Across the experiments, man et al. 2014): the source data contains 20 examples per TFMKL is found to be robust to these parameters. Base ker- class randomly selected from Amazon source (8 from other nels are predetermined as: linear kernel, polynomial kernels source domains); the labeled target data contains 3 labeled with six degrees (i.e., 1.5,1.6,...,2.0) and Gaussian kernels examples per class from target domain. Gaussian kernel with six bandwidths (i.e., 0.0001, 0.001, 0.01, 0.1, 1, 10). with the bandwidth 0.001 is employed as outside kernel in The comparative methods (except TFMKL and DTMKL-f) Fredholm integral. The averaged classification results on the are evaluated with these 13 base kernels respectively and the other unlabeled target data over 20 random splits are shown best results are reported. in Table 2 and some observations can be drawn. First, SVM-st (or Fred-st) shows higher accuracies than SVM-t (or Fred-t) on A→C, D→W, D→C, W→D, W→C and C→A. The possible explanation is that under the fol- lowing two cases, direct involving the source data may be helpful for recognition in the target domain: a) the two do- mains have some similarities and the distributions may over- lap between each other, e.g., DSLR and Webcam, Amazon and Caltech-256; b) the target domain contains relatively rich images and has an extensive distribution in the feature space, e.g., Caltech-256. Second, KMM, GFK, MMDT and DTMKL-f generally outperform SVM-st and SVM-t, veri- Figure 1: Synthetic example. fying the advantage of bridging the gap between domains in domain adaptation. Nevertheless, their results are slightly worse than those of Fred-st and Fred-t. The reason is that Synthetic Example noise suppression is exactly required in these challenging vi- We construct a synthetic example to illustrate the noise- sual recognition tasks which contain a great number of noisy suppression power of TFMKL in semi-supervised domain images. However, SVM, 1NN and SVR, which are the cus- adaptation, shown in Figure 1. The two-dimensional syn- tomized classifiers of these transfer methods, show their lim- thetic data violates the cluster assumption where multi- itation in noise suppression. Third, TFMKL performs im- variate Gaussian noise with variance σ =0.01 is added. pressively better than all other methods on most of the data TFMKL is compared with the semi-supervised DTMKL-f, sets (8 out of 12). The average accuracy of TFMKL on the as well as the standard SVM-st and Fred-st. For each class, 12 data sets is 57.0%, equivalent to a 4.1% improvement 3 labeled target samples are selected. We conduct 50 ran- compared to the most competitive Fred-st. Note that Fred-st dom selections and the average classification results on the yields higher accuracies than TFMKL on D→W and W→D, other unlabeled target data are reported in Table 1. As can since the two domains are significantly similar. These results be seen, DTMKL-f and TFMKL outperform SVM-st and show the effectiveness and the robustness of TFMKL, in- Fred-st due to the exploration of distribution inconsistency. dicating it can successfully minimize distribution mismatch Nevertheless, although DTMKL-f also uses unlabeled target and make confident predictions. points during classifier learning, it gains limited discrimina- tive information from these noisy data. By contrast, TFMKL Cross-Domain Text Classification provides superior robustness and accuracy, demonstrating its Data Preparation 20-Newsgroups is a benchmark text effectiveness in noise resiliency and knowledge transfer. corpora organized in a hierarchical structure with differ-

2736 Table 2: Classification accuracy and standard error on the 12 cross-domain object categorization data sets. The best recognition rates are in red and bold font. The second best recognition rates are in blue and italics font. Standard Learning Transfer Learning Data Set SVM-st SVM-t Fred-st Fred-t DTMKL-f MMDT KMM GFK TFMKL A→D 45.6±0.7 55.9±0.8 46.4±0.8 58.2±0.9 46.4±0.7 56.7±1.3 49.0±0.7 50.7±0.8 62.1±0.9 A→W 46.3±0.7 62.4±0.9 50.3±0.9 69.2±1.1 48.6±0.8 64.6±1.2 47.4±0.9 58.6±1.0 69.8±0.9 A→C 39.8±0.3 32.0±0.8 39.0±0.4 35.3±0.9 40.5±0.4 36.4±0.8 40.8±0.3 36.0±0.5 43.6±0.6 D→A 41.6±0.4 45.7±0.9 51.5±0.5 51.7±0.8 42.6±0.4 46.9±1.0 42.6±0.4 45.7±0.6 52.9±0.7 D→W 77.4±0.6 62.1±0.8 82.4±0.3 65.4±0.8 76.1±0.5 74.1±0.8 78.5±0.6 76.5±0.5 77.8±0.6 D→C 35.9±0.4 31.7±0.6 37.7±0.4 35.2±0.8 37.5±0.5 34.1±0.8 36.9±0.4 32.9±0.5 37.9±0.4 W→A 43.4±0.3 45.6±0.7 49.5±0.5 52.3±0.7 45.3±0.4 47.7±0.9 44.4±0.3 44.1±0.4 54.4±0.4 W→D 69.5±0.8 55.1±0.8 73.2±0.6 58.2±0.8 69.9±1.1 67.0±1.1 70.4±0.8 70.5±0.7 67.7±0.9 W→C 36.4±0.4 30.4±0.7 37.5±0.3 34.6±0.9 37.8±0.4 32.2±0.8 37.6±0.4 31.1±0.6 36.1±0.8 C→A 46.9±0.6 45.3±0.9 52.1±0.6 49.9±1.0 49.5±0.9 49.4±0.8 48.0±0.6 44.7±0.8 54.2±0.9 C→D 52.0±1.0 55.8±0.9 55.7±0.9 57.2±1.1 53.1±0.9 56.5±0.9 53.0±1.0 57.7±1.1 59.2±1.1 C→W 53.6±0.9 60.3±1.0 60.6±1.2 62.0±1.1 55.4±1.1 63.8±1.1 54.6±0.9 63.7±0.8 68.2±0.9 Mean 49.0±0.6 48.5±0.8 52.9±0.6 52.3±0.8 50.2±0.7 52.5±1.0 50.3±0.6 51.0±0.7 57.0±0.8

G4K]YMXU[VY Y# H4K]YMXU[VY Y#

Figure 2: Classification accuracy and standard error: (a) 20-Newsgroups with s =5; (b) 20-Newsgroups with s =10. ent categories and subcategories (Pan et al. 2011). Data distributions, which challenges the existing transfer meth- from different subcategories under the same category is re- ods. (3) TFMKL performs consistently better than all other lated, making the corpora well-suited for constructing cross- methods, especially when s =5. These results illustrate the domain data sets. As in (Duan, Tsang, and Xu 2012), the superiority of TFMKL in handing multiple difficult cases four largest main categories in 20-Newsgroups (i.e., comp, (e.g., significantly varied distributions and a small amount rec, sci and talk) are selected to construct the following six of labeled target data) for domain adaptation. cross-domain data sets: comp vs rec, comp vs sci, comp vs talk, recvssci, rec vs talk and sci vs talk. Conclusion In this paper, we have proposed a novel Transfer Fred- Results of Text Classification Following the setup in holm Multiple Kernel Learning (TFMKL) approach to in- (Duan, Tsang, and Xu 2012): the source data contains all troduce a new paradigm in semi-supervised domain adapta- the labeled samples in source domain; the labeled target data tion. Specifically, TFMKL harnesses unlabeled target sam- contains 2s labeled examples (s positive and s negative sam- ples in both the alignment of distributions and the adaptation ples) randomly selected from target domain and the unla- of predictive models. Such a paradigm is of great impor- beled target data contains the remaining unlabeled examples tance in tackling challenging cross-domain problems, since for training and testing. Linear kernel is used as outside ker- it possesses the following advantages: 1) noise resiliency re- nel. The classification results are shown in Figure 2 by av- sulting from a good approximation to the “true” data space; eraging over 5 random splits with s =5or s =10and we 2) capacity of propagating complex knowledge by minimiz- have the following observations. (1) The supervised and the ing distribution difference; 3) flexibility due to the explo- semi-supervised transfer methods (MMDT and DTMKL-f) ration of a unified RKHS from multiple base kernels. For the are better than the unsupervised methods (KMM and GFK) convergence guarantee, we propose a simple but effective in terms of accuracy, verifying the effectiveness of utilizing optimization procedure based on the differentiability proof. labeled target data. (2) MMDT and DTMKL-f generally out- Comprehensive experimental results validate the advantages perform the standard SVM-st, SVM-t, Fred-st and Fred-t on of the proposed method. all the data sets except sci vs talk, which validates the neces- sity of domain adaptation. As can be seen, SVM-t (or Fred-t) Acknowledgements achieves higher accuracy than SVM-st (or Fred-st) on sci vs This work is supported by Natural Science Foundation of talk. It means that the two domains have significantly varied China (61303164,61402447,61502466,61672501,61602453

2737 and 61503365). This work is also sponsored by Devel- Huang, J.; Smola, A. J.; Gretton, A.; Borgwardt, K. M.; and opment Plan of Outstanding Young Talent from Institute Scholkopf,¨ B. 2006. Correcting Sample Selection Bias by of Software, Chinese Academy of Science (ISCAS2014- Unlabeled Data. In Proceedings of the 20th Annual Con- JQ02). ference on Advances in Neural Information Processing Sys- tems, (NIPS) 601-608. References Li, L.; Jin, X.; and Long, M. 2012. Topic Correlation Analy- Aytar, Y., and Zisserman, A. 2011. Tabula Rasa: Model sis for Cross-Domain Text Classification. In Proceedings of Transfer for Object Category Detection. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, (AAAI) the 24th IEEE Conference on Computer Vision and Pattern 998-1004. Recognition (CVPR), 2252-2259. Liao, X.; Xue, Y.; and Carin, L. 2005. Logistic Regres- Bach, F. R.; Lanckriet, G. R. G; and Jordan, M. I. 2004. Mul- sion with An Auxiliary Data Source. In Proceedings of tiple Kernel Learning, Conic Duality, and the SMO Algo- the 22nd International Conference on Machine learning, rithm. In Proceedings of the 21st International Conference (ICML) 505-512. on Machine learning (ICML), 6-13. Luenberger, D. G. 1984. Linear and Nonlinear Program- Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Mani- ming. Addison-Wesley. fold Regularization: A Geometric Framework for Learning Pan, S. J., and Yang, Q. 2010. A Survey on Transfer Learn- from Labeled and Unlabeled Examples. Journal of Machine ing. IEEE Transactions on Knowledge and Data Engineer- Learning Research 7:2399-2434. ing 22:1345-1359. Bonnaus, J. F., and Shaoiro, A. 1998. Optimization Prob- Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Do- lems with Perturbation: A Guided Tour. SIAM Review main Adaptation via Transfer Component Analysis. IEEE 40(2):202-227. Transactions on Neural Networks 22(2):199-210. Borgwardt, K. M.; Gretton, A.; Rasch, M. J.; Kriegel, H.- Que, Q; Belkin, M; and Wan, Y. 2014. Learning with Fred- P.; Scholkopf, B.; and Smola, A. J. 2006. Integrating Struc- holm Kernels. In Proceedings of the 28th Annual Confer- tured Biological Data by Kernel Maximum Mean Discrep- ence on Advances in Neural Information Processing Systems ancy. Bioinformatics 22(14): 49-57. (NIPS), 2951-2959. Boyd, S., and Vandenberghe, L. Convex Optimization. Cam- Rakotomamonjy, A.; Bach, F. R.; Canu, S.; and Grandvalet, bridge University Press, Cambridge, 2004. Y. 2008. SimpleMKL. Journal of Machine Learning Re- search 9:2491-2521. Chapelle, O.; Weston, J.; and Scholkopf,¨ B. 2003. Cluster Kernels for Semi-Supervised Learning. In Proceedings of Rifkin, R.; Yeo, G.; and Poggio, T. 2003. Regularized Least the 17th Annual Conference on Advances in Neural Infor- Squares Classification. Advances in Learning Theory: Meth- mation Processing Systems (NIPS), 585-592. ods, Model and Applications. NATO Science Series III: Computer and Systems Sciences 190:131-153. Chen, M.; Weinberger, K. Q.; and Blitzer, J. C. 2011. Co- Training for Domain Adaptation. In Proceedings of the 25th Scholkopf, B., and Smola, A. J. 2001. Learning with Ker- Annual Conference on Advances in Neural Information Pro- nels: Support Vector Machines, Regularization, Optimiza- cessing Systems (NIPS), 2456-2464. tion, and Beyond. MIT Press Cambridge. Daume´ III, H. 2007. Frustratingly Easy Domain Adaptation. Shawe-Taylor, J., and Cristianini, N. 2004. Kernel Methods In Proceedings of the 47th Annual Meeting of the Associa- for Pattern Analysis. Cambridge University Press. tion for Computational Linguistics (ACL), 256-263. Sun, Q.; Chattopadhyay, R.; Panchanathan, S.; and Ye, J. Daume´ III, H., and Marcu, D. 2006. Domain Adaptation for 2011. A Two-Stage Weighting Framework for Multi-Source Statistical Classifiers. Journal of Artificial Intelligence Re- Domain Adaptation. In Proceedings of the 25th Annual Con- search 26:101-126, 2006. ference on Advances in Neural Information Processing Sys- tems (NIPS), 505-513. Duan, L.; Tsang, I. W.; and Xu, D. 2012. Domain Trans- fer Multiple Kernel Learning. IEEE Transactions on Pattern Wang, X.; Huang, T.; and Schneider, J. 2014. Active Trans- Analysis and Machine Intelligence 34(3):465-479. fer Learning under Model Shift. In Proceedings of the 31st International Conference on Machine learning (ICML), Evgeniou, T.; Micchelli, C. A.; and Pontil, M. 2005. Learn- 1305-1313. ing Multiple Tasks with Kernel Methods. Journal of Ma- chine Learning Research 6: 615-637. Wang, W.; Wang, H.; Zhang, C.; and Xu, F. J. 2015. Trans- fer Feature Representation via Multiple Kernel Learning. In Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic Proceedings of the 28th AAAI Conference on Artificial Intel- Flow Kernel for Unsupervised Domain Adaptation. In Pro- ligence (AAAI), 3073-3079. ceedings of the 25th IEEE Conference on Computer Vision Yen, E. H.; Lin, T. W; Lin, S. D.; Ravikumar, P. K.; and and Pattern Recognition (CVPR), 2066-2073. Dhillon, I. S. 2014. Sparse Random Feature Algorithm as Hoffman, J.; Rodner, E.; Donahue, J.; Kulis, B.; and Saenko, Coordinate Descent in Hilbert Space. In Proceedings of the K. 2014. Asymmetric and Category Invariant Feature Trans- 28th Annual Conference on Advances in Neural Information formations for Domain Adaptation. International Journal of Processing Systems (NIPS), 2456-2464. Computer Vision 109(1):28-41.

2738