Fredholm Multiple Kernel Learning for Semi-Supervised Domain Adaptation
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Fredholm Multiple Kernel Learning for Semi-Supervised Domain Adaptation Wei Wang,1 Hao Wang,1,2 Chen Zhang,1 Yang Gao1 1. Science and Technology on Integrated Information System Laboratory 2. State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing 100190, China [email protected] Abstract port vector machine, logistic regression and statistic clas- sifiers, to the target domain. However, they do not ex- As a fundamental constituent of machine learning, domain plicitly explore the distribution inconsistency between the adaptation generalizes a learning model from a source do- main to a different (but related) target domain. In this paper, two domains, which essentially limits their capability of we focus on semi-supervised domain adaptation and explic- knowledge transfer. To this end, many semi-supervised do- itly extend the applied range of unlabeled target samples into main adaptation methods are proposed within the classifier the combination of distribution alignment and adaptive clas- adaptation framework (Chen, Weinberger, and Blitzer 2011; sifier learning. Specifically, our extension formulates the fol- Duan, Tsang, and Xu 2012; Sun et al. 2011; Wang, Huang, lowing aspects in a single optimization: 1) learning a cross- and Schneider 2014). Compared with supervised ones, they domain predictive model by developing the Fredholm integral take into account unlabeled target samples to cope with the based kernel prediction framework; 2) reducing the distribu- inconsistency of data distributions, showing improved clas- tion difference between two domains; 3) exploring multiple sification and generalization capability. Nevertheless, most kernels to induce an optimal learning space. Correspondingly, work in this area only incorporates unlabeled target samples such an extension is distinguished with allowing for noise re- siliency, facilitating knowledge transfer and analyzing diverse in distribution alignment, but not in adaptive classifier learn- data characteristics. It is emphasized that we prove the differ- ing. Large quantities of unlabeled target samples potentially entiability of our formulation and present an effective opti- contain rich information and incorporating them is desir- mization procedure based on the reduced gradient, guaran- able for noise resiliency. Moreover, some work in this semi- teeing rapid convergence. Comprehensive empirical studies supervised setting is customized for certain classifiers, and verify the effectiveness of the proposed method. the extension to other predictive models remains unclear. In conventional supervised learning, kernel methods pro- Introduction vide a powerful and unified prediction framework for build- ing non-linear predictive models (Rifkin, Yeo, and Poggio Conventional supervised learning aims at generalizing a 2003; Shawe-Taylor and Cristianini 2004; Yen et al. 2014). model inferred from labeled training samples to test sam- To further incorporate unlabeled samples with labeled ones ples. For well generalization capability, it is required to col- in model learning, the kernel prediction framework is de- lect and label plenty of training samples following the same veloped into semi-supervised setting by formulating a regu- distribution of test samples, however, which is extremely ex- larized Fredholm integral equation (Que, Belkin, and Wan pensive in practical applications. To relieve the contradiction 2014). Although this development is proven theoretically between generalization performance and label cost, domain and empirically to be effective in noise suppression, the per- adaptation (Pan and Yang 2010) has been proposed to trans- formance heavily depends on the choice of a single prede- fer knowledge from a relevant but different source domain fined kernel function. The reason is that its solution is based with sufficient labeled data to the target domain. It gains in- on Representer Theorem (Scholkopf and Smola 2001) in the creased importance in many applied areas of machine learn- Reproducing Kernel Hilbert Space (RKHS) induced by the ing (Daume´ III 2007; Pan et al. 2011), including natural lan- kernel. More importantly, the kernel methods are not devel- guage processing, computer vision and WiFi localization. oped for domain adaptation, and the distribution difference Supervised domain adaptation takes advantage of both will invalidate the predictive models across two domains. abundant labeled samples from source domain and a rel- In this paper, we focus on semi-supervised domain adap- atively small number of labeled samples from target do- tation and extend the applied range of unlabeled target sam- main. A line of recent work in this supervised setting (Ay- ples from the distribution alignment into the whole learn- tar and Zisserman 2011; Daume´ III and Marcu 2006; Liao, ing process. This extension further explores the structure in- Xue, and Carin 2005; Hoffman et al. 2014) learns cross- formation in target domain, enhancing robustness in com- domain classifiers by adapting traditional models, e.g., sup- plex knowledge propagation. The key challenge is to accom- Copyright c 2017, Association for the Advancement of Artificial plish this extension through a single optimization combining Intelligence (www.aaai.org). All rights reserved. the following three aspects: 1) learning a predictive model 2732 across two domains based on Fredholm integrals utilizing the following optimization problem over H: (labeled) source data and (labeled and unlabeled) target data; n 1 2 2) reducing the distribution difference between domains; 3) f = arg min L(f(xi),yi)+βfH, (1) f∈H n exploring a convex combination of multiple kernels to op- i=1 timally induce the RKHS. In dealing with the above diffi- where β is a tradeoff parameter, L(z,y) is a risk function, culties, the proposed algorithm named Transfer Fredholm e.g., square loss: (z − y)2 and hinge loss: max(1 − zy,0). Multiple Kernel Learning (TFMKL) has a three-fold contri- In semi-supervised setting, suppose the labeled and the bution. First, compared with traditional semi-supervised do- unlabeled training data are drawn from the distribution main adaptation, TFMKL learns a predictive model for noise P (X, Y) and the marginal distribution P (X) respectively. resiliency and improved adaptation power. It is also of high Associated with an outside kernel kP (x, z), the regular- expansibility due to the compatibility with many classifiers. ized Fredholm integral KP for f(x) (Que, Belkin, and Wan Second, from the view of kernel prediction, the challenge of 2014) is introduced to explore unlabeled data: KP f(x)= reducing distribution difference is suitably addressed. More- P (x z) (z) (z) z k , f P d . In this term, Eq. (1) can be rewrit- over, multiple kernels are optimally combined in TFMKL, 1 n 2 ten as: f = arg min L(KP f(xi),yi)+βf . It allowing for analyzing the useful data characteristics from n i=1 H has been shown that the performance of supervised algo- various aspects and enhancing the interpretability of predic- rithms could be degraded under the “noise assumption”. By tive model. Third, instead of employing alternate optimiza- contrast, the f based on the Fredholm integral will be re- tion technique, we prove the differentiability of our formula- silient to noise and closer to the optimum, because this inte- tion and propose a simple but efficient procedure to perform gral provides a good approximation to the “true” data space. reduced gradient descent, guaranteeing rapid convergence. However, this framework is effective only when labeled and Experimental results on a synthetic example and two real- unlabeled data follow the same distribution. In addition, it world applications verify the effectiveness of TFMKL. P P relies on the single k and the choice of k heavily influ- ences the performance. Relative Work Domain Adaptation Method In this section, the problem in TFMKL is firstly defined. In supervised domain adaptation, cross-domain classi- Second, we improve the kernel prediction framework and fiers (Aytar and Zisserman 2011; Hoffman et al. 2014) extend it to semi-supervised domain adaptation. Third, are learnt by using labeled source samples and a small the distribution difference is minimized, which is a cru- number of labeled target samples. Meanwhile, some semi- cial element to gain more support from source to tar- supervised methods (Chen, Weinberger, and Blitzer 2011; get domain. Fourth, the above two goals are unified to- Duan, Tsang, and Xu 2012; Sun et al. 2011; Wang, Huang, gether in TFMKL. Furthermore, Multiple Kernel Learn- and Schneider 2014) are proposed by combining the trans- ing (MKL) (Bach, Lanckriet, and Jordan 2004) is exploited fer of classifiers with the match of distributions. This setting within the framework to improve the flexibility. Finally, we encourages the domain-invariant characteristics, leading to propose a reduced gradient descent procedure to update the improved adaptation results for classification. Specifically, kernel function and the predictive model simultaneously, ex- (Chen, Weinberger, and Blitzer 2011) minimizes conditional plicitly enabling rapid convergence. distribution difference and adapts logistic regression to tar- get data simultaneously. Inspired from Maximum Mean Dis- Notations and Settings crepancy (Borgwardt et al. 2006), the methods in (Sun et al. Suppose the data originates from two domains, i.e., source 2011; Wang, Huang, and Schneider 2014) match