Proceedings of the Twenty-Fourth International Joint Conference on (IJCAI 2015)

Supervised Representation Learning: with Deep Fuzhen Zhuang1, Xiaohu Cheng1,2, Ping Luo1, Sinno Jialin Pan3, Qing He1 1Key Laboratory of Intelligent Information Processing, Institute of Technology, Chinese Academy of Sciences, Beijing, China. {zhuangfz, heq}@ics.ict.ac.cn, [email protected] 2University of Chinese Academy of Sciences, Beijing, China. [email protected] 3Nanyang Technological University, Singapore 639798. [email protected]

Abstract A common objective of feature-based transfer learning meth- ods is to learn a transformation to project instances from dif- Transfer learning has attracted a lot of attention ferent domains to a common latent space where the differ- in the past decade. One crucial research issue in ence of the projected instances between domains can be re- transfer learning is how to find a good representa- duced [Blitzer et al., 2006; Dai et al., 2007a; Pan et al., 2008; tion for instances of different domains such that the 2011; Zhuang et al., 2014]. divergence between domains can be reduced with Recently, because of the power on learning high-level the new representation. Recently, features, deep learning has been applied to transfer learn- has been proposed to learn more robust or higher- ing [Xavier and Bengio, 2011; Chen et al., 2012; Joey level features for transfer learning. However, to the Tianyi Zhou and Yan, 2014]. Xavier and Bengio [2011] pro- best of our knowledge, most of the previous ap- posed to learn robust features with stacked denoising autoen- proaches neither minimize the difference between coders (SDA) [Vincent et al., 2010] on the union of data of a domains explicitly nor encode label information in number of domains. The learned new features are considered learning the representation. In this paper, we pro- as high-level features, and used to represent both the source pose a supervised representation learning method and target domain data. Finally, standard classifiers, e.g., based on deep autoencoders for transfer learning. support vector machines (SVMs), are trained on the source The proposed deep consists of two domain with the new representations, and make encoding layers: an embedding layer and a label predictions on the target domain data with the new represen- encoding layer. In the embedding layer, the dis- tations. Chen et al. [2012] extended the work of SDA, and tance in distributions of the embedded instances be- proposed the marginalized SDA (mSDA) for transfer learn- tween the source and target domains is minimized ing. mSDA addresses two limitations of SDA: highly com- in terms of KL-Divergence. In the label encoding putational cost and lack of scalability with high-dimensional layer, label information of the source domain is en- features. More recently, Joey Tianyi Zhou and Yan [2014] coded using a softmax regression model. Extensive proposed a deep learning approach to heterogeneous transfer experiments conducted on three real-world image learning based on an extension of mSDA, where instances in datasets demonstrate the effectiveness of our pro- the source and target domains are represented by heteroge- posed method compared with several state-of-the- neous features. In their proposed method, the bridge between art baseline methods. the source and target domains with heterogeneous features is built based on the corresponding information of instances be- tween the source and target domains, which is assumed to be 1 Introduction given in advance. Transfer learning focuses on adapting knowledge from an Though the goal of previous deep-learning-based methods auxiliary source domain to a target domain with little or with- for transfer learning is to learn a more powerful feature rep- out any label information to build a target prediction model resentation to reduce the difference between domains, most of good generalization performance. In the past decade, a lot of them did not explicitly minimize the distance between do- of attention has been paid on developing methods to trans- mains when learning the representation. Therefore, the reduc- fer knowledge effectively across domains [Pan and Yang, tion in difference between domains is not guaranteed with the 2010]. A crucial research issue in transfer learning is how learned feature representation. Furthermore, most previous to reduce difference between the source and target domains methods are unsupervised, and thus fail to encode discrimi- while preserving original data properties. Among different native information into the representation learning. approaches to transfer learning, the feature-based transfer In this paper, we propose a supervised representation learn- learning methods have proven to be superior for the scenar- ing method for transfer learning based on deep autoencoders. ios where original raw data between domains are very differ- Specifically, the proposed method, named Transfer Learning ent while the divergence between domains can be reduced. with Deep Autoconders (TLDA), is shown in Figure 1. In

4119 5.2!3-.* Table 1: The Notation and Denotation D Ds, Dt The source and target domains !"#$%%&'( KL !"#$%&'()*+),,-!. ns The number of instances in source domain  b  nt The number of instances in target domain ! s ! !"# b x s m The number of original features z s ! k The number of nodes in embedding layer b c The number of nodes in label layer s ! " !! # $ s ! x   b  (s) (t)  !! xi , xi The i-th instance of source and target domains (s) (t) (s) (t) xˆi , xˆi The reconstructions of xi and xi !"#$%%&'( (s) (s) yi The label of instance xi  b  (s) (t) (s) (t) t ! t !  b  ξi , ξi The hidden representations of xi and xi x !! t ξˆ(s) ξˆ(t) ξ(s) ξ(t) z ! i , i The reconstructions of i and i (s) (t) (s) (t) " #b $ zi , zi The hidden representations of ξi and ξi t ! !! t ! x   b  Wi, bi Encoding weight and bias matrix for layer i !! 0 0 Wi , bi Decoding weight and bias matrix for layer i 6)2!3-.* > The transposition of a matrix /0)',!1+2)'%.3'#%+*)#'3!$%-.,',0%+)'#0)',%$)').2!3-.*' ◦ The dot product of vectors or matrixes %.3'3)2!3-.*'4)-*0#,

to one or more hidden layers through several encoding pro- Figure 1: The framework of TLDA cesses, then decodes the hidden layers to obtain an output xˆ. Autoencoder tries to minimize the deviation of xˆ from the in- TLDA, there are two encoding and decoding layers, respec- put x, and the process of autoencoder with one hidden layer tively, where the encoding and decoding weights are shared can be summarized as: by both the source and target domains. The first encoding Encoding : ξ = f(W1x + b1) (1) layer is referred to as the embedding layer, where the distri- 0 0 butions of the source- and target- domain data are enforced to Decoding : xˆ = f(W1ξ + b1) (2) be similar by minimizing the KL divergence [Kullback, 1987] where f is a nonlinear (the sigmoid func- of the embedded instances between domains. The second en- k×m 0 m×k tion is adopted in this paper), W1 ∈ R and W1 ∈ R coding layer is referred to as the label encoding layer, where k×1 0 m×1 the source domain label information is encoded using a soft- are weight matrices, b1 ∈ R and b1 ∈ R are bias vec- tors, and ξ ∈ Rk×1 is the output of the hidden layer. Given max regression model [Friedman and Rob, 2010], which can n a set of inputs {xi}i=1, the reconstruction error can be com- naturally handle multiple classes. Note that, in the second en- Pn 2 puted by i=1 kxˆi − xik . The goal of autoencoder is to coding layer, the encoding weights are also used for the final 0 classification model. In summary, there are three key features learn the weight matrices W1 and W1, and the bias vectors 0 in our proposed TLDA: b1 and b1 by minimizing the reconstruction error as follows, 1. The encoding and decoding weights are shared across n X 2 different domains for knowledge transfer. min kxˆi − xik (3) W ,b ,W 0 ,b0 1 1 1 1 i=1 2. The distributions of two domains are enforced to be sim- ilar in the embedding space. 2.2 Softmax Regression 3. The label information is encoded. The softmax regression model [Friedman and Rob, 2010] is a generalization of the model for multi- class classification problems, where the class label y can take 2 Preliminary Knowledge more than two values, i.e., y ∈ {1, 2, ..., c} (where c ≥ 2 is In this section, we first review some preliminary knowledge the number of class labels.). For a test instance x, we can that is used in our proposed framework. Note that frequently estimate the probabilities of each class that x belongs to as used notations are listed in Table 1, and unless otherwise follows, > specified, all the vectors are column vectors.  θ1 x   p(yi = 1|x; θ)  e θ>x p(yi = 2|x; θ) 1  e 2  2.1 Autoencoders h (x) = =   (4) θ  .  c θ>x .  .  P j  .  The basic framework of autoencoder [Bengio, 2009] is a feed . j=1 e  .  θ>x forward neural network with an input layer, an output layer p(yi = c|x; θ) e c and one or more hidden layers between them. An autoen- c > P θj x coder framework usually includes the encoding and decod- where j=1 e is a normalized term, and θ1, ··· , θc are ing processes. Given an input x, autoencoder first encodes it the model parameters.

4120 n k×m k×1 Given the training set {xi, yi}i=1, yi ∈ {1, 2, ..., c}, the W1 ∈ R , and a bias vector b1 ∈ R . The output solution of softmax regression can be derived by minimizing of first layer is the input for the second hidden layer. The the following optimization problem, second hidden layer is called the label layer with an output z ∈ c×1 of c nodes (equals to the number of class label),  >  R n c θ xi c×k c×1 1 X X e j a weight matrix W2 ∈ , and a bias vector b2 ∈ . min − 1{y = j} log , (5) R R  i c θ>x  Here, the softmax Regression is used as the regularization θ n P e l i i=1 j=1 l=1 item on source domain to incorporate label information. In where 1{·} is an indicator function, whose value is 1 if the addition, the output of the second layer is used as the pre- expression is true, otherwise 0. Once the model is trained, diction results for target domain. The third hidden layer ξˆ one can compute the probability of instance x belonging to a is the reconstruction of the embedding layer with the cor- label j using Eq. (4), and assign its class label as responding weight matrix and bias vector ξˆ ∈ Rk×1 and 0 k×c 0 k×1 > W ∈ , b ∈ . Finally, xˆ is the reconstruction θj x 2 R 2 R e m×1 0 m×k 0 m×1 y = max > . (6) of x with xˆ ∈ R , W1 ∈ R , and b1 ∈ R . j Pc θ x l=1 e l The second term in the objective Eq. (7) is the KL diver- 2.3 Kullback-Leibler Divergence gence of embedded instances between the source and target domains, which can be written as Kullback-Leibler (KL) divergence [Kullback, 1987], also Γ (ξ(s), ξ(t)) = D (P ||P ) + D (P ||P ), (11) known as the relative entropy, is a non-symmetric measure of KL s t KL t s the divergence between two probability distributions. Given where k×1 k×1 ns 0 0 1 two probability distributions P ∈ R and Q ∈ R , the X (s) Ps P = ξ , P = 0 , (12) s i s P P KL divergence of Q from P is the information lost when Q ns s is used to approximate P [Liddle et al., 2010], defined as i=1 n k P (i) t 0 P 0 1 X (t) P DKL(P ||Q) = i=1 P (i) ln( Q(i) ). In this paper, we adopt t Pt = ξi , Pt = P 0 . (13) n Pt the symmetrized version of KL-divergence, KL(P , Q) = t i=1 DKL(P ||Q) + DKL(Q||P ), to measure the divergence for The goal of minimizing the KL divergence of the embedded classification problems, Smaller value of KL divergence in- instances between the source and target domains is to ensure dicates more similar of two distributions. Thus, we use the the source and target data distributions to be similar in the KL divergence to measure the difference between two data embedding space. domains when they are embedded to the same latent space. The third term in the objective Eq. (7) is the of softmax regression to incorporate the label information of 3 Transfer Learning with Deep Autoencoders the source domain into the embedding space. Specifically, 3.1 Problem Formalization this term can be formalized as follows, > (s) (s) (s) ns ns c θ ξ Given two domains D , and D , where D ={x , y }| 1 X X (s) e j i s t s i i i=1 L(θ, ξ(s)) = − 1{y = j} log , (s) m×1 i c > (s) is the source domain labeled data with x ∈ , and ns P θl ξi i R i=1 j=1 l=1 e (s) (t) nt yi ∈ {1, ..., c}, while Dt = {xi }|i=1 is the target do- > where θj (j ∈ {1, ..., c}) is the j-th row of W2. main with unlabeled data. Here, ns and nt are the numbers Finally, the last term in the objective Eq. (7) is an regular- of instances in Ds and Dt, respectively. ization on model parameters, which is defined as follows, As shown in Figure 1, there are three factors to be taken 0 0 2 2 2 2 into consideration for representation learning. Therefore, the Ω(W , b, W , b ) = kW1k +kb1k + kW2k + kb2k 0 0 0 0 objective to be minimized in our proposed learning frame- + kW k2 +kb k2 + kW k2 + kb k2. work for transfer learning can be formalized as follows, 1 1 2 2 The trade-off parameters α, β, and γ are positive constants (s) (t) (s) J = Jr(x, xˆ) + αΓ (ξ , ξ ) + βL(θ, ξ ) to balance the effect of different terms to the overall objective. 0 + γΩ(W , b, W , b0). (7) 3.2 Model Learning The minimization problem of Eq. (7) with respect to W1, b1, The first term of the objective is the reconstruction error for 0 0 0 0 both source and target domain data, which can be defined as, W2, b2, W2, b2, W1, and b1 is an unconstrained optimiza- tion problem. To solve this problem, we adopt the gradient nr X X (r) (r) J (x, xˆ) = ||x − xˆ ||2, (8) descent methods. For succinctness, we first introduce some r i i intermediate variables as follows. r∈{s,t} i=1     A(r) = xˆ(r) − x(r) ◦ xˆ(r) ◦ 1 − xˆ(r) , where i i i i i (r) (r) (r) (r) (r) ˆ(r)  ˆ(r) ξi = f(W1xi + b1), zi = f(W2ξi + b2), (9) Bi = ξi ◦ 1 − ξi , (r) 0 (r) 0 (r) 0 (r) 0 ˆ ˆ (r) (r)  (r) ξi = f(W2zi + b2), xˆi = f(W1ξi + b1). (10) Ci = zi ◦ 1 − zi , The first hidden layer is called the embedding layer with   k×1 (r) (r) (r) an output ξ ∈ R of k nodes (k ≤ m), a weight matrix Di = ξi ◦ 1 − ξi .

4121 The partial derivatives of the objective Eq. (7) w.r.t. W1, problem is not convex, and thus there is no guarantee on ob- 0 0 0 0 b1, W2, b2, W2, b2, W1, and b1 can be computed as follows taining an optimal global solution. To achieve a better local respectively, optimal solution of the proposed approach,

ns we first run SAE on all source and target domain data, and ∂J X 0> (s) > 0> (s) (s) (s) (s)> then use the output of SAE to initialize the encoding and de- = 2W1 Ai ◦ (W2 (W2 Bi ◦ Ci )) ◦ Di xi ∂W1 i=1 coding weights. nt X 0> (t) > 0> (t) (t) (t) (t)> + 2W1 Ai ◦ (W2 (W2 Bi ◦ Ci )) ◦ Di xi Algorithm 1 Transfer Learning with Deep Autoencoders i=1 (TLDA) ns α X (s) Pt Ps (s)> (s) (s) ns + Di ◦ (1 − + ln( ))xi (14) Input: Given one source domain Ds = {xi , yi }|i=1, and ns Ps Pt i=1 (t) nt one target domain Dt = {xi }|i=1, trade-off parameters α, nt α X (t) Ps Pt (t)> β, γ, the number of nodes in embedding layer and label layer, + Di ◦ (1 − + ln( ))xi + 2γW1 nt Pt Ps k and c. i=1 Output: Results of label layer z and embedded layer ξ. n c (s) s > W2ξ β X X (s) > W2 e i 0 0 0 0 − 1{yi = j}(W2j − (s) ) 1. Initialize W1, W2, W2, W1 and b1, b2, b2, b1 by ns P W2lξi i=1 j=1 l e Stacked Autoencoders performed on both source and (s) (s)> target domains; ◦Di xi 2. Compute the partial derivatives of all variables accord- ing to Eqs. (14), (15) (16) and (17); ns ∂J X 0> 0> (s) (s) (s) (s)> = 2W2j (W1 Ai ◦ Bi ) ◦ Cij ξi 3. Iteratively update the variables using Eq. (18); ∂W2j i=1 4. Continue Step2 and Step3 until the algorithm converges; nt X 0> 0> (t) (t) (t) (t)> + 2W2j (W1 Ai ◦ Bi ) ◦ Cij ξi (15) 5. Computing the embedding layer ξ and label layer z us- i=1 ing (9), and then construct target classifiers as described nsj n (s) s W2j ξ in Section 3.3. β X (s)> X e i (s)> − ( ξi − (s) ξi ) + 2γW2j , nsj P W2lξi i=1 i=1 l e 3.3 Classifier Construction ns ∂J X 0> (s) (s) (s)> 0 = 2W A ◦ B z + 2γW After all the parameters are learned, we can construct clas- ∂W 0 1 i i i 2 2 i=1 sifiers for the target domain in two ways. The first way is nt directly to use the output of the second hidden layer. That X 0T (t) (t) (t)> (t) + 2W1 Ai ◦ Bi zi , (16) is, for any instance x in the target domain, the output i=1 (t) (t) of the label layer z = f(W2ξ + b2) can indicate the ns nt (t) ∂J X (s) (s)> X (t) (t)> 0 probabilities of x which class it belongs to. We choose = 2A ξˆ + 2A ξˆ + 2γW , (17) 0 i i i i 1 the maximum probability and the corresponding label as the ∂W1 i=1 i=1 prediction. The second way is to apply standard classifica- where W2j is the j-th row of W2, and nsj is the number tion algorithms, e.g., logistic regression(LR) [Snyman, 2005; of instance with the label j in source domain. As the partial 0 0 Friedman and Rob, 2010] to train a classifier for source do- derivatives of the objective Eq.(7) w.r.t. b1, b2, b2, b1 are main in the embedding space. Then the classifier is applied to 0 0 very similar to those of W1, W2, W2, W1, respectively, we predict class labels for target domain data. These two meth- omit the details due to the limit of space. Based on the above ods are denoted as TLDA1 and TLDA2, respectively. partial derivatives, we develop an alternatively iterating algo- rithm to derive the solutions by using the following rules, 4 Experimental Evaluation ∂J ∂J In this section, we conduct extensive experiments on three W1 ← W1 − η , b1 ← b1 − η , ∂W1 ∂b1 real-world image data sets to show the effectiveness of the 0 0 ∂J 0 0 ∂J proposed framework. Two of the three datasets are on binary W1 ← W1 − η 0 , b1 ← b1 − η 0 , classification, and the rest one is on multi-class classification. ∂W ∂b 1 1 (18) ∂J ∂J 4.1 Datasets and Preprocessing W ← W − η , b ← b − η , 2 2 2 2 1 ∂W2 ∂b2 ImageNet Data Set contains five domains, i.e., D1 (ambu- 0 0 ∂J 0 0 ∂J lance+scooter), D2 (taxi+scooter), D3 (jeep+scooter), D4 W ← W − η , b ← b − η , 2 2 ∂W 0 2 2 ∂b0 (minivan+scooter) and D5 (passenger car+scooter). Data 2 2 from different domains come from different categories, e.g., where η is the step length, which determines the speed of taxi from D2 and jeep from D3, therefore this dataset is convergence. The details of the proposed algorithm is sum- marized in Algorithm 1. Note that the proposed optimization 1http://www.image-net.org/download-features

4122 Table 2: Description of the ImageNet dataset 100 D1 D2 D3 D4 D5 #positive instance 1510 1326 1415 1555 986 90 #negitive instance 1427 1427 1427 1427 1427 #features 1000 1000 1000 1000 1000 80

70

proper for transfer learning study. To construct classification Accuracy(%) problems, we randomly choose two from the five domains, 60 where one is considered as the source domain and the other LR is considered as the target domain. Therefore, we construct 50 TCA 2 mSDA 20 (P5 ) transfer learning classification problems. Statistics TLDA1 TLDA2 of this dataset is shown in Table 2. 40 2 [ ] 0 5 10 15 20 Corel Data Set Zhuang et al., 2010 includs two different Problem instances top categories, flower and traffic. Each top category further consists of four subcategories. We use flower as positive in- stances and traffic as negative ones. To construct the trans- Figure 2: Classification accuracy on the ImageNet dataset fer learning classification problems, we randomly select one subcategory from flower and one from traffic as the source do- Table 3: Average results (%) on three data sets main, and then choose another subcategory of flower and an- other one of traffic from the remaining subcategories to con- LR TCA mSDA TLDA1 TLDA2 struct the target domain. In this way, we can construct 144 ImageNet Data Set 2 2 Left 67.0 64.3 67.6 83.4 87.4 (P4 · P4 ) transfer learning classification problems. Leaves Data Set [Mallah and Orwell, 2013] includes 100 Right 81.2 76.3 84.1 89.0 90.2 plant species that are divided into 32 different genera, and T otal 80.5 75.7 83.3 88.7 90.1 each specie has 16 instances. We choose four genuses with Corel Data Set more than four plant species to construct 4-class classification Left 61.7 65.4 70.5 71.1 74.0 problems, and use 64 shape descriptor features to represent an Right 80.1 82.0 75.4 83.2 83.0 instance. Each genus is regarded as a domain. Similar to the T otal 74.8 76.5 74.0 79.6 80.4 2 construction of ImageNet dataset, we can construct 12 (P4 ) Leaves Data Set 4-class classification problems. Left 51.9 65.9 47.2 64.1 57.8 Right 75.0 89.8 59.4 91.4 89.8 4.2 Baseline Methods T otal 55.7 69.9 49.2 68.6 63.2 We compare our methods with the following baselines, • The supervised learning algorithm Logistic Regression 4.3 Experimental Results (LR) [Friedman and Rob, 2010] without transfer learn- All the results of these three data sets are shown in Figure 2 ing. and Table 3. Figure 2 shows the results over the 20 classi- • Transfer component analysis(TCA) [Pan et al., 2011], fication problems on the ImageNet dataset, in which x-axis which aims at learning a low-dimensional representation represents the index of the problems, and y axis represents for transfer learning. Here we also use Logistic Regres- the corresponding accuracy. From the figure, we have the fol- sion as the basic classifier. lowing observations, • TLDA is significantly better than LR on every problem, • Transfer learning based on stacked autoencoders, the which indicates the efficiency of our proposed transfer marginalized Stacked Denoising Autoencoders (mSDA) learning framework. method [Chen et al., 2012]. • TLDA performs better than TCA, which shows the su- Implementation Details: After some preliminary experi- periority of applying deep autoencoders to learn a good ments, we set α = 0.5, β = 0.5, γ = 0.00001 and k = 10 representation for transfer learning. TLDA also outper- for the ImageNet and Corel datasets, while β = 0.05, k = 5 forms mSDA, which indicates the effectiveness of incor- and γ = 0.0001 for the Leaves dataset. For mSDA, we use porating label information from source domain. the authors’ source code3 and adopt the default parameters as reported in [Chen et al., 2012]. For TCA, the number of la- • LR performs slightly worse than mSDA, even better than tent dimensions is carefully tuned, e.g., for the Corel dataset, TCA. This may be because on constructed cross-domain the number is sampled from [10, 80] with interval 10, and its classification problems, it is not easy to make knowledge best results are reported. transfer success. This observation again validates the effectiveness of our methods. 2http://archive.ics.uci.edu/ml/datasets/Corel+Image+Features. We also divide the constructed problems into two groups: 3http://www.cse.wustl.edu/ mchen/ a first group consists of problems on which the classification

4123 100 100 100

90 90 90

80 80 80

70 70 70

60 60 60

50 50 50 problem instance 1 problem instance 1 problem instance 1 problem instance 2 problem instance 2 problem instance 2 40 40 40

Accuracy (%) problem instance 3 Accuracy (%) problem instance 3 Accuracy (%) problem instance 3 problem instance 4 problem instance 4 problem instance 4 30 problem instance 5 30 problem instance 5 30 problem instance 5 problem instance 6 problem instance 6 problem instance 6 20 problem instance 7 20 problem instance 7 20 problem instance 7 problem instance 8 problem instance 8 problem instance 8 10 problem instance 9 10 problem instance 9 10 problem instance 9 problem instance 10 problem instance 10 problem instance 10 0 0 0 0.01 0.05 0.1 0.5 1 5 10 50 100 0.01 0.05 0.1 0.5 1 5 10 50 100 5 10 20 30 40 50 60 70 80 α β k (a) The Parameter Influence of α (b) The Parameter Influence of β (c) The Parameter Influence of k

Figure 3: The Study of Parameter Influence on TLRA1 accuracy of LR is lower than 70%, and the rest problems are Xing et al., 2007; Jiang and Zhai, 2007; Zhuang et al., 2010; considered as a second group. The lower of classification Crammer et al., 2012]. The other is based on the fea- accuracy of LR in some certain indicate the higher degree ture representation level, which aims to learn a new fea- of the difficulty in knowledge transfer. The averaged accu- ture representation for both the source and target domain racy of these two group as well as the averaged accuracy over data, such that with the new feature representation the differ- all problems on the three three datasets are reported in Ta- ence between domains can be reduced [Blitzer et al., 2006; ble 3. We can find that the proposed methods perform better Dai et al., 2007a; Pan et al., 2008; Si et al., 2010; Pan than all the compared algorithms on the both groups of prob- et al., 2011; Xavier and Bengio, 2011; Chen et al., 2012; lems, except for that on the Leaves dataset, the performance Zhuang et al., 2014]. of TLDA1 is comparable with that of TCA. Among most feature-based transfer learning methods, only a few methods aim to minimize the difference between 4.4 Parameter Sensitivity domains explicitly in learning the new feature representa- In this section, we investigate the influence of the param- tion. For instance, maximum mean discrepancy embedding eters α, β and k in the objective Eq.(7). In this exper- (MMDE) [Pan et al., 2008] and transfer component analy- iment, when tuning one parameter, the values of the rest sis (TCA) [Pan et al., 2011] try to minimize the distance in two are fixed. Specifically, α and β are sampled from distributions between domains in a kernel Hilbert space, re- {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100}, and k is selected from spectively. The transfer subspace learning framework pro- {10, 20, ..., 80}. We select 10 out of the 144 problems on the posed by [Si et al., 2010] tries to find a subspace, where the Corel dataset for experiment, and report the results in Fig- distributions of the source and target domain data are simi- ure 3. From the figure, we can observe that the performance lar, through a minimization on the KL divergence of the pro- of TLDA1 is relatively stable to the selection of α and β, jected instances between domains. However, they are either while it decreases dramatically when the value of k is large. based on kernel methods or regularization frameworks, rather Thus we set α = 0.5, β = 0.5 and k = 10 to achieve good than exploring a deep architecture to learn feature represen- and stable results for the ImageNet and Corel datasets. tations for transfer learning. Different from previous works, in this paper, our proposed TLDA is a supervised represen- 5 Related Work tation learning method based on deep learning, which takes distance minimization between domains and label encoding Poultney et al. [2006] proposed an unsupervised method with of the source domain into consideration. an energy-based model for learning sparse and overcomplete features. In their method, the decoder produces accurate re- constructions of the patches, while the encoder provides a fast 6 Conclusion prediction of the code without the need for any particular pre- processing of the inputs. Vincent and Manzagol [2008] pro- In this paper, we proposed a supervised representation learn- posed Denoising autoencoders to learn a more robust repre- ing framework for transfer learning with deep autoencoders. sentation from an artificially corrupted input, and further pro- In this framework, the well known representation learning posed Stacked denoising autoencoders [Vincent et al., 2010] model autoencoder is considered, and we extend it to a deeper to learn useful representations through a deep network. architecture. Indeed there are two layers for encoding, one Transfer learning has attracted much attention in the past is for embedding, where we impose the KL divergence con- decade. To reduce the difference between domains, two cate- strains to draw the two distributions of source and target do- gories of transfer learning approaches have been proposed. mains similar. The other is label layer, by which we can easily One is based on the instance level, which aims to learn incorporate the label information from source domain. Fi- weights for the source domain labeled data, such that the re- nally, we conduct a series of experiments on three real-world weighted source domain instances look similar to the target image data sets, and all the results demonstrate the effective- domain data instances [Dai et al., 2007b; Gao et al., 2008; ness of the proposed methods.

4124 Acknowledgments. [Pan and Yang, 2010] Sinno Jialin Pan and Qiang Yang. A This work is supported by the National Natural Science survey on transfer learning. IEEE Transaction on Knowl- Foundation of China (No. 61473273, 61473274, 61175052, edge and Data Engineering, 22(10):1345–1359, 2010. 61203297), National High-tech R&D Program of China (863 [Pan et al., 2008] S. J. Pan, J. T. Kwok, and Q. Yang. Trans- Program) (No.2014AA015105, 2013AA01A606). Sinno J. fer learning via . In Proceedings Pan is supported by the NTU Singapore Nanyang Assistant of the 23rd AAAI, 2008. Professorship (NAP) grant M4081532.020. [Pan et al., 2011] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, Qiang Yang, et al. Domain adaptation via transfer References component analysis. IEEE Transactions on Neural Net- [Bengio, 2009] . Learning deep architectures works, 22(2):199–210, 2011. for ai. Foundations and trends in , [Poultney et al., 2006] Christopher Poultney, Sumit Chopra, 2(1):1–127, 2009. Yann L Cun, et al. Efficient learning of sparse representa- [Blitzer et al., 2006] John Blitzer, Ryan McDonald, and Fer- tions with an energy-based model. In Advances in neural nando Pereira. Domain adaptation with structural corre- information processing systems, pages 1137–1144, 2006. spondence learning. In Proceedings of the 2006 Confer- [Si et al., 2010] Si Si, Dacheng Tao, and Bo Geng. Breg- ence on EMNLP, pages 120–128, 2006. man divergence-based regularization for transfer subspace [Chen et al., 2012] Minmin Chen, Zhixiang Eddie Xu, Kil- learning. IEEE Transaction on Knowledge and Data En- ian Q. Weinberger, and Fei Sha. Marginalized denoising gineering, 22(7):929–942, 2010. autoencoders for domain adaptation. In Proceedings of the [Snyman, 2005] Jan Snyman. Practical mathematical opti- 29th ICML, 2012. mization: an introduction to basic optimization theory and classical and new gradient-based algorithms, volume 97. [Crammer et al., 2012] Koby Crammer, Mark Dredze, and Springer Science & Business Media, 2005. Fernando Pereira. Confidence-weighted linear classifi- cation for text . The Journal of Machine [Vincent and Manzagol, 2008] Bengio Vincent, Larochelle Learning Research, 13(1):1891–1926, 2012. and Manzagol. Extracting and composing robust fea- tures with denoising autoencoders. In Proceedings of the [Dai et al., 2007a] W. Y. Dai, G. R. Xue, Q. Yang, and Y. Yu. 25th international conference on Machine learning, pages Co-clustering based classification for out-of-domain doc- 1096–1103. ACM, 2008. uments. In Proceedings of the 13th ACM SIGKDD, 2007. [Vincent et al., 2010] Pascal Vincent, Hugo Larochelle, Is- [ ] Dai et al., 2007b W. Y. Dai, Q. Yang, G. R. Xue, and Y. Yu. abelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man- Boosting for transfer learning. In Proceedings of the 24th zagol. Stacked denoising autoencoders: Learning useful ICML, 2007. representations in a deep network with a local denoising [Friedman and Rob, 2010] Hastie Trevor Tibshirani Fried- criterion. The Journal of Machine Learning Research, man, Jerome and Rob. Regularization paths for general- 11:3371–3408, 2010. ized linear models via coordinate descent. Journal of sta- [Xavier and Bengio, 2011] Antoine Xavier and Bengio. Do- tistical software, 33(1):1, 2010. main adaptation for large-scale sentiment classification: A [Gao et al., 2008] J. Gao, W. Fan, J. Jiang, and J. W. Han. deep learning approach. In Proceedings of the 28th ICML, Knowledge transfer via multiple model local structure pages 513–520, 2011. mapping. In Proceedings of the 14th ACM SIGKDD, 2008. [Xing et al., 2007] D. K. Xing, W. Y. Dai, G. R. Xue, and [Jiang and Zhai, 2007] Jing Jiang and Chengxiang Zhai. In- Y. Yu. Bridged refinement for transfer learning. In Pro- stance weighting for domain adaptation in nlp. In In ACL, ceedings of the 10th PAKDD. 2007. pages 264–271, 2007. [Zhuang et al., 2010] Fuzhen Zhuang, Ping Luo, Hui Xiong, [Joey Tianyi Zhou and Yan, 2014] Ivor Tsang Joey Yuhong Xiong, Qing He, and Zhongzhi Shi. Cross-domain Tianyi Zhou, Sinno Jialin Pan and Yan Yan. Hybrid learning from multiple sources: a consensus regularization heterogeneous transfer learning through deep learning. In perspective. IEEE Transactions on Knowledge and Data Proceedings of the 28th AAAI, pages 2213–2220, 2014. Engineering, 22(12):1664–1678, 2010. [Kullback, 1987] Solomon Kullback. Letter to the editor: the [Zhuang et al., 2014] Fuzhen Zhuang, Xiaohu Cheng, kullback-leibler distance. 1987. Sinno Jialin Pan, Wenchao Yu, Qing He, and Zhongzhi Shi. Transfer learning with multiple sources via consensus [Liddle et al., 2010] Andrew R Liddle, Pia Mukherjee, and regularized autoencoders. In Machine Learning and David Parkinson. Model selection and multi-model infer- Knowledge Discovery in , pages 417–431. ence. Bayesian Methods in Cosmology, 1:79, 2010. Springer, 2014. [Mallah and Orwell, 2013] Cope Mallah and Orwell. Plant leaf classification using probabilistic integration of shape, texture and margin features. Signal Processing, and Applications, 2013.

4125