<<

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

A Hybrid Collaborative Filtering Model with Deep Structure for Recommender Systems

Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang Ctrip Travel Network Technology (Shanghai) Co., Limited. Shanghai, P.R.China {dongxin, yu lei, zh wu, yx sun, lfyuan, fxzhang}@ctrip.com

Abstract The most successful approach among CF-based methods Collaborative filtering(CF) is a widely used approach in is to learn effective latent factors directly by matrix factor- recommender systems to solve many real-world problems. ization technique from the user-item rating matrix (Koren et Traditional CF-based methods employ the user-item matrix al. 2009). However, the rating matrix is often very sparse which encodes the individual preferences of users for items in real world, causing CF-based methods to degrade sig- for learning to make recommendation. In real applications, nificantly in learning the appropriate latent factors. In par- the rating matrix is usually very sparse, causing CF-based ticular, this phenomenon occurs seriously in online travel methods to degrade significantly in recommendation perfor- agent(OTA) websites such as Ctrip.com, since user access mance. In this case, some improved CF methods utilize the in- these websites with lower frequency. Moreover, another lim- creasing amount of side information to address the data spar- itation for CF-based methods is how to provide recommen- sity problem as well as the cold start problem. However, the dations when a new item arrives in the system, which is also learned latent factors may not be effective due to the sparse nature of the user-item matrix and the side information. To known as the cold start problem. The reason of the existence address this problem, we utilize advances of learning effec- about cold start is that the systems cannot recommend new tive representations in deep learning, and propose a hybrid items which have not yet receive rating information from model which jointly performs deep users and items’ latent users. factors learning from side information and collaborative fil- In order to overcome the cold start and data sparsity prob- tering from the rating matrix. Extensive experimental results lems, it is inevitable for CF-based methods to exploit addi- on three real-world datasets show that our hybrid model out- tional sources of information about the users or items, also performs other methods in effectively utilizing side informa- tion and achieves performance improvement. known as the side information, and hence hybrid CF meth- ods have gained popularity in recent years (Shi, Larson, and Hanjalic 2014). The side information can be obtained from Introduction user profile and item content information, such as demo- In recent years, with the growing number of choices avail- graphics of users, properties of items, etc. Some hybrid CF- able online, recommender systems are becoming more and based methods (Singh and Gordon 2008; Nickel, Tresp, and more indispensable. The goal of recommender systems is Kriegel 2011; Wang and Blei 2011) have integrated side in- to help users in identifying the items that best fit their per- formation into matrix factorization to learn effective latent sonal tastes from a large repository of items. Besides, many factors. However, these methods employ the side informa- commerce companies have been using recommender sys- tion as regularizations and the learned latent factors are of- tems to target their customers by recommending items. Over ten not effective especially when the rating matrix and side the years, various algorithms for recommender systems have information are very spare (Agarwal, Chen, and Long 2011). been developed. Such algorithms can roughly be catego- Therefore, it is highly desirable to realize this latent factor rized into two groups (Shi, Larson, and Hanjalic 2014): learning problem from such datasets. content-based and collaborative filtering(CF) based meth- Recently, one of the powerful methods to learn effective ods. Content-based methods (Lang 1995) utilize user profile representations is deep learning (Hinton and Salakhutdinov or item content information for recommendation. CF-based 2006; Hinton, Osindero, and Teh 2006). Thus, with large- methods (Salakhutdinov and Mnih 2011), on the other hand, scale ratings and rich additional side information, it is na- ignore user or item content information and use the past ture to integrate deep learning in recommender systems to activities or preferences, such as user buying/viewing his- learn latent factors. Thereby, some researches have made use tory or user ratings on items, to recommendation. Neverthe- of deep learning directly for the task of collaborative filter- less, CF-based methods are often preferred to content-based ing. Work (Salakhutdinov, Mnih, and Hinton 2007) employs methods because of their impressive performance (Su and restricted Boltzmann machines to perform CF. Although Khoshgoftaar 2009). this method combines deep learning and CF, it does not Copyright c 2017, Association for the Advancement of Artificial incorporate side information, which is crucial for accurate Intelligence (www.aaai.org). All rights reserved. recommendation. Moreover, work (Van den Oord, Diele-

1309 k man, and Schrauwen 2013; Wang and Wang 2014) directly Let ui, vj ∈ R be user i’s latent factor vector and item j’s uses convolutional neural network(CNN) or deep belief net- latent factor vector respectively, where k is the dimensional- work(DBN) to obtain latent factors for content information, ity of the latent space. Therefore, the corresponding matrix but they are content-based methods which only infer latent forms of latent factors for users and items are U = u1:m factors for items and the methods are especially fit for music and V = v1:n, respectively. Given the sparse rating matrix R datasets. Furthermore, work (Wang, Wang, and Yeung 2015; and the side information matrix X and Y, the goal is to learn Li, Kawale, and Fu 2015) utilizes Bayesian stacked denois- user latent factors U and item latent factors V, and hence to ing auto-encoders(SDAE) or marginalized SDAE to CF but predict the missing ratings in R. requires learning of a large number of manually adjusted hy- per parameters. Matrix Factorization In this paper, to address the challenges above, we pro- pose a hybrid collaborative filtering model with deep struc- An effective collaborative filtering approach is matrix fac- ture for recommender systems. We first present a novel deep torization (Koren et al. 2009). By factorizing the user-item learning model called additional stacked denoising autoen- interactions matrix, matrix factorization can map both users coder(aSDEA), which extends the stacked denoising autoen- and items to a joint latent factor space. Therefore, user-item coder to integrate additional side information into the in- interactions are modeled as inner products in that space. For- puts, and then overcomes cold start problem and data spar- mally, matrix factorization decomposes the original rating sity problem. With this, we then present our hybrid model matrix R into two low-rank matrices U and V consisting of the user and item latent factor vectors respectively, such that which tightly couples deep representation learning for the ≈ additional side information and collaborative filtering for the R UV. Given the latent factor vectors for users and items, ratings matrix. Experiments show that our hybrid model sig- a user’s rating for a movie is predicted by the inner product nificantly outperforms the state of the art. Specifically, the of those vectors. main contributions of this paper can be summarized as the The objective function of matrix factorization can be writ- following three aspects: ten as: • L T || ||2 || ||2 We propose a hybrid collaborative filtering model, which arg min (R, UV )+λ( U F + V F ), integrates deep presentation learning and matrix factor- U,V ization. It simultaneously extracts effective latent factors where L(·, ·) is the loss function that measures the distance from side information and captures the implicit relation- between two matrices with the same size, the last two terms ship between users and items. are the regularizations used to avoid overfitting and || · ||F • We present a novel deep learning model aSDAE, which is denotes the Frobenius norm. By specifying different L(·, ·), a variant of SDAE and can integrate the side information many matrix factorization models have been proposed, for into the learned latent factors efficiently. example, non-negative matrix factorization (Lee and Se- • We conduct experiments on three real-world datasets to ung 2001), probabilistic matrix factorization (Salakhutdi- evaluate the effectiveness of our hybrid model. Experi- nov and Mnih 2011), Bayesian probabilistic matrix factor- mental results show that our hybrid model outperforms ization (Salakhutdinov and Mnih 2008), max-margin matrix four state-of-art methods in terms of root mean squared factor (Srebro, Rennie, and Jaakkola 2004), etc. error(RMSE) and recall metrics. When side information are available, some matrix factor- ization models generate a rating from the product of latent Preliminaries factor vectors which contain additional information about users or items. Various models show that additional side in- In this section, we start with formulating the problem dis- formation can act as a useful informative prior that can sig- cussed in this paper, and then have a brief view on matrix nificantly improve results (Porteous, Asuncion, and Welling factorization. 2010; Singh and Gordon 2008). Problem Definition Additional Stacked Denoising Similar to some existing works(Hu, Koren, and Volinsky 2008), this paper also takes implicit feedback as training and In this section we first provide a introduction of additional testing data to complete the recommendation task. In a stan- denoising autoencoder and then give a detailed description dard recommendation setting, we have m users, n items, and of additional stacked denoising autoencoder(aSDAE). an extremely sparse rating matrix R ∈ Rm×n. Each entry Rij of R corresponds to user i’s rating on item j.IfRij =0 , Additional Denoising Autoencoder it means the rating about user i on item j is observed, oth- An autoencoder is a specific form of neural network, which erwise unobserved. Each user i can be represented by a par- consists of an encoder and a decoder component. The en- (u) ∈ Rn tially observed vector si =(Ri1, ..., Rin) . Identi- coder g(·) takes a given input s and maps it to a hidden rep- cally, each item j can be represented by a partially observed resentation g(s), while the decoder f(·) maps this hidden (i) ∈ Rm vector sj =(R1j , ..., Rmj) . Moreover, the addi- representation back to a reconstructed version of s, such that tional side information matrix of user and item are denoted f(g(s)) ≈ s. The parameters of the autoencoder are learned by X ∈ Rm×p and Y ∈ Rn×q, respectively. to minimize the reconstruction error, measured by some loss

1310 for l b and l V of the aSDAE , l , ) 1} ) ) W ˆ s l ˆ x b b − b is computed as: + l + + h L ˜ x L l h h V L L 1, ..., L + W V ( ( 1 f f ∈{ − l layers of the model act as an en- l layers act as a decoder. The aS- = = h , the outputs are computed as: 2 l 2 L W s ˆ ˆ x L/ layer, given the total number of layers ( L/ g 2 = is one of the corrupted inputs. L/ l s ˜ h = 0 h A Hybrid Collaborative Filtering Model For the output layer where For each hidden layer model, the hidden representation In this section, we propose a hybrid collaborative filtering Note that the first L. • tasks (Rifai et al.al. 2009; 2011; Glorot, Chen Bordes, and etnoising Bengio autoencoder(SDAE) al. 2011). stacks The several 2012; stacked DAEs together de- Kavukcuoglu to create et higher-level representations (Vincent etspired al. by 2010). the In- stacked denoisingtiple aDAE autoencoder, together we to stack form mul- autoencoder(aSDAE). an The additional model stacked of aSDAE denoising is shownure in 1(b) Fig- and the generative process is• presented as follows: In some CF-based methods,fer the effective main and challengesand high-level are items to latent from in- factor rawmeet vectors inputs. the for MF-based requirements methods users tionship are so between able the as to users tofrom and the items. capture However, cold the theydeep start suffer implicit learning and rela- models data havetive sparsity been in problems. shown discovering Moreover, to high-levelthe be raw hidden highly input effec- representations dataLi, from for Kawale, a and variety FuTherefore, of 2015; it tasks Wang, is (Shen Wang, straightforward andability et to Yeung from al. 2015). take deep 2014; over learningtering the to algorithms. improve expressive the collaborative fil- model which unifies ourization aSDAE for model recommender with systems. matrix factor- Overview The proposed modelof is both a hybrid ratingaSDAE model, matrix and which matrix and makes factorization sidetion together. use information Matrix is factoriza- a and widelylent combines scalability used and model-based accuracy, and CFto aSDAE method is extract a with high-level powerful excel- combination way representations of from this raw twolearning inputs. models more leverages The expressive their models. benefits for coder and the last DAE employs a deepminimizes network the to squared reconstruct loss thestructions. between inputs The inputs and objective and function their forEquation aSDAE recon- is (1). similar Accordingly, with we can learn is each layer using the back-propagationDAE model, algorithm. we In assume our that only aS- close one hidden to layer the should be latenterated factor from and the the latent factor vector is gen- 1311

Ă Ă (1) and and s ) · , ( ) g + , the cor-

Ă 2 F ] 2 F || n , additional l ] s ˆ n Ă X|| ||V x . Ă − ) ) , ..., + · h Ă Ă 1 from its corrupted s 2 F b , ..., ||X s || 1 Ă ) l + x α Ă =[ , ) ˜ x tanh( ) ˜ X. It then encoders and ˆ s − 1 ||W S ˆ x s))). Usually, choices of =[ ˜ b ( V Ă b l g X is bias vector, and +  ( and + + ( +(1 h s b ˜ λ ˜ S h

Ă 2 2 F 1 2 s,f ( (b) AdditionalAutoencoder(aSDAE) Stacked Denoising W V ˆ S|| W L ( ( ( − f f g

Ă Ă represent the reconstructions of = = = ||S ˆ x to obtain α } s ˆ ˆ x h l X and b { s ˆ , represent the corrupted version of the original Ă } l and x, Ă ˜ x V S { are weight matrix, by minimizing , s))). However, denoising (DAE) in- } is a trade-off parameter which balances the outputs, ( and Figure 1: The models of aDAE and aSDAE and l s ˜ V g s s α ˜ is a regularization parameter. ( W { are activation functions such as represents the hidden latent representation of the inputs, λ and Ă ) s,f The objective function considers the losses between all In this paper, we extend the denoising autoencoder to in- · h ( ( arg min (a) AdditionalAutoencoder(aDAE) Denoising the inputs and theirthe reconstructions. following Then, optimization problem: an aDAE solves corruption include additive isotropicnary Gaussian masking noise noise orious (Vincent bi- et types al. oferal 2008). autoencoders domains Moreover, to have var- Lee show been et promising al. developed 2009). results(Chen in et sev- al. 2012; tegrate additional side information intoin the Figure inputs, 1(a). as shown Given a sample set x, W f corporate a slight modificationstructs to the this input setup,tivation which from of recon- a learningthe corrupted a input(Vincent et version more al.trained with effective 2008). to the representation reconstruct A the denoising from mo- original autoencoder input is responding side information set denoising autoencoder(aDAE) considers ations random over corrup- where inputs Additional Stacked Denoising Autoencoder Existing literatures have shown thattogether multiple can layers stacked generateers, rich and representations therefore in leads to hidden better lay- performance for various where and version L decoders the inputs as follows: about users and items respectively, and the corresponding Ă corrupted versions are X˜ and Y˜. Obviously, Figure 2 illus- Ă Ă Ă trates our hybrid collective filtering model. It indicates that ˜(u) ˜(i) ˜ ˜ Ă Ă Ă

Ă the inputs of the hybrid model are S , S , X, Y and R.As Ă Ă shown in Equation (2), the first term is the loss function of matrix factorization to decompose the rating matrix R into Ă user and item latent factor matrices, i.e.,  T T 2 LR(R, UV )= Iij(Rij − uiv ) User-Item j Rating Matrix R × i,j where I is an indicator matrix indicating the non-empty en-

Ă tities in R. The last two terms are the loss functions of our aSDAE models which extract latent factors from the hidden layers for users and items respectively. For simplicity, we set Ă Ă Ă Ă Ă Ă β and δ to 1 in Equation (2). Therefore, the objective func- tion of our hybrid model is formulated as follows: Ă Ă Ă   − T 2 (u) − (u) 2 Ă L = Iij(Rij uivj ) + α1 (si ˆsi ) i,j  i  2 (i) (i) 2 +(1− α1) ( − ˆ ) + α2 ( − ˆ ) Figure 2: The structure of the proposed hybrid model. The xi xi sj sj model contains three components: the upper component and i j the lower component are two aSDAEs which extract latent − − 2 · +(1 α2) (yj yˆj) + λ freg (3) factor vectors for users and items respectively; the middle j component decomposes the rating matrix R into two latent factor matrices. where α1,α2are trade-off parameters, and freg are the reg- ularization terms that prevent overfitting, i.e.,    2 2 2 2 Given the user-item rating matrix R, we first transform freg = ||ui||F + ||vj ||F + (||Wl||F + ||Vl||F (u) (u) (u) R into the set S containing m instances {s1 , ..., sm }, i j l (u) { } || || || ||2 || ||2 || ||2 , where si = Ri1, ..., Rin is the n-dimensional feedback + bl + Wl F + Vl F + bl F ) vector of user i on all the items. Similarly, we can ob- ( ) ( ) ( ) (i) { i i } i Wl, Vl and Wl, Vl are the weight matrices for two aSDAEs tain set S with n instances s1 , ..., sn , where sj = at layer l, bl and bl are the corresponding bias vectors. {R1 , ..., R } is the m-dimensional feedback vector of j mj Generally, the middle layers of two aSDAEs server as item j rated by all the users. Our hybrid model learns user (u) (i) bridges between the ratings and additional side information. and item latent factors(i.e., U and V) from R, S , S and These two middle layers are the key that enables our hybrid the additional side information(i.e., X and Y) through the model to simultaneously learn effective latent factors and following optimization objective: capture the the similarity and relationship between users and L T || ||2 || ||2 items. arg min R(R, UV )+λ( U F + V F )+ U,V βL (S(u), X, U)+δL (S(i), Y, V), (2) Optimization Although the optimization of the objective function is not where L (·, ·) is the loss function for decomposing the R jointly convex in all the variables, it is convex to each of rating matrix R into two latent factor matrices U and V, them when fixing the others. Therefore, we can alternately L (·, ·, ·) is the function that connects the user or item side optimize for each of the variables in the above objective information with the latent factors, β and δ are the trade-off function. parameters and λ is a regularization parameter. Note that, the last two terms in Equation (2) devised using our aSDAE For ui and vj, we use stochastic gradient descent(SGD) model which extracts latent factor matrix from the rating algorithm to learn these latent factors. For simplicity, we let matrix and additional side information. L(U, V) denote the objective function when other variables irrelevant to U and V are fixed. Therefore, the update rules Our Hybrid Model are: Let S(u) ∈ Rm×n and S(i) ∈ Rn×m denote the matrices ∂ (u) (i) ui = ui − η L(U, V), obtained in the above section, and let S˜ and S˜ denote ∂ui ∈ Rm×p their corrupted versions respectively. Moreover, X − ∂ ∈ Rn×q vj = vj η L(U, V), and Y are the additional side information matrices ∂vj

1312 where η is the learning rate, and the detail gradients are as of movie genre are encoded into a binary valued vector of follows: length 1822.  (u) The last dataset, Book-Crossing dataset, contains ∂ ( ) ( ) ∂ˆs L(U, V)=α (s u − ˆs u ) i + λu 1149780 ratings from 278858 users on 271379 books. ∂u i i ∂u i The rating is expressed on a scale from 0 to 10 with the i i i higher values denoting higher appreciation. However, we  ∂ˆ  − − xi − − T +(1 α) (xi xˆi) (Rij uivj )vj binarize explicit data by keeping the ratings of six or higher ∂ui and interpret them as implicit feedback. This leads to the i i,j∈I user-item matrix with a sparsity of 99.99%. Some attributes (i)  ∂ˆs for users and books are also provided in this dataset. The ∂ (i) − (i) j L(U, V)=α (sj ˆsj ) + λvj user and item additional matrices are generated as the above ∂vj ∂vj j datasets, and the lengths of the two binary vectors are 1973  ∂yˆ  and 3679. − − j − − T +(1 α) (yj yˆj) (Rij uivj )ui. ∂vj j i,j∈I Evaluation Metric

Note that, we set α1 equal to α2 in Equation (3) for sim- We employ the root mean squared error(RMSE) as one of plicity. Moreover, given U and V, we can learn the weight the evaluation metrics,  matrices and biases for each layer using the popular back-   propagation learning algorithm. By alternating the update of  1 2 RMSE =  (Rij − Rˆ ij) , variables, a local optimum for L can be found. Neverthe- |T| less, we can use some common techniques such as using a Rij ∈T momentum term to alleviate the local optimum problem. where Rij is the rating of user i on item j, Rˆ ij denotes the Prediction corresponding predicted rating, T is the test set and |T| is the total number of ratings in the test set. After the latent factors for each user and item are learned, Similar to (Wang, Wang, and Yeung 2015; Wang and Blei ˆ ˆ ≈ T we approximate the predicted rating Rij as: Rij uivj , 2011), we use recall as another evaluation metric since the and then a list of ranked items is generated for each user rating information is in the form of implicit feedback (Hu, based on these prediction ratings. Koren, and Volinsky 2008; Rendle et al. 2009). Specifically, another common metric, precision, is not suited for implicit Experiments feedback. Because a zero rating in the user-item matrix may In this section, we evaluate the performance of our hybrid be due to the fact that the user is not interested in the item, model with three real-world datasets from different domains, or that the user is unaware of it. To evaluate our hybrid and compare our hybrid model with four state-of-art algo- model, we sort the predicted ratings of all the items for each rithms. user, and then recommend the top K items to each user. The recall@K for each user is defined as follows: Datasets number of items the user likes in top K recall@K = . We use three datasets from different real-world domains, total number of items the user likes two from MovieLens and one from Book-Crossing dataset, The final metric result is the average recall over all users. for our experiments. The first two datasets, MovieLens- 100K and MovieLens-1M, are commonly used for evaluat- Baselines and Parameter Settings ing the performance of recommender systems (Wang, Shi, and Yeung 2015; Li, Kawale, and Fu 2015). The MovieLens- In order to evaluate the performance of our model, we com- 100K dataset contains 100K ratings from 943 users on 1682 pare it with the following recommendation algorithms: movies, and the MovieLens-1M dataset contains more than • PMF. Probabilistic Matrix Factorization (Salakhutdinov 1 million ratings from 6040 users on 3706 movies. Each rat- and Mnih 2011) is a model to factorize the user-item ma- ing is an integer between 1 and 5. We binarize explicit data trix to user and item factors. It assumes there exists Gaus- by keeping the ratings of four or higher and interpret them sion observation noise and Gaussian priors on the latent as implicit feedback. Therefore, MovieLens-1M is much factor vectors. sparser as only 2.57% of its user-item matrix entries con- • tain ratings but MovieLens-100K has ratings in 3.49% of CMF. Collective Matrix Factorization (Singh and Gordon its user-item matrix entries. Moreover, we extract the user 2008) is a model which simultaneously factorizes multi- and item information provided by the datasets to construct ple sources, including the user-item matrix and matrices the additional matrices X and Y respectively. To summarize, containing the additional side information. the user side information contains the user’s ID, age, gen- • CDL. Collaborative Deep Learning (Wang, Wang, and der, occupation and zipcode are encoded into a binary val- Yeung 2015) is a hierarchical deep Bayesian model to ued vector of length 1943. Identically, the item side informa- achieve deep representation learning for the item informa- tion contains the item’s title, release data and 18 categories tion and collaborative filtering for the user-item matrix.

1313 Table 1: Average RMSE of compared models with different percentages of training data on three datasets MovieLen-100K MovieLen-1M Book-Crossing Model 60% 80% 95% 60% 80% 95% 60% 80% 95% PMF 0.7024 0.5941 0.5673 0.6966 0.5715 0.5415 1.7534 1.2453 1.1896 CMF 0.6986 0.5881 0.5454 0.6758 0.5709 0.5382 1.5467 1.0563 0.9715 CDL 0.6601 0.5667 0.5213 0.6546 0.5435 0.5221 1.4465 0.9921 0.9652 DCF 0.6516 0.5516 0.5135 0.6635 0.5467 0.5335 1.3413 0.9784 0.9448 Ours 0.6436 0.5435 0.5079 0.6449 0.5236 0.5023 1.3206 0.9579 0.9244

(a) Recall on MovieLen-100K (b) Recall on MovieLen-1M (c) Recall on Book-Crossing

Figure 3: Performance comparison of PMF, CMF, CDL, DCF and Ours based on recall@K for the three datasets.

• DCF. Deep Collaborative Filtering (Li, Kawale, and Fu our hybrid model obtains lower RMSE than CDL and DCF, 2015) is a model which combines PMF with marginalized which validates the strengths of the latent factor vectors denoising stacked autoencoders to achieve recommenda- learned by our aSDAE models. Therefore, the RMSE metric tion. demonstrates the effectiveness of our hybrid model. • Ours. Our approach is proposed as described above. It is Figure 3 shows the recall results that compare PMF, CMF, a hybrid collaborative filtering model which unifies our CDL, DCF and Our hybrid model using the three datasets. aSDAE model with matrix factorization. We can see that PMF is the worse model because of the lack of additional side information. Moreover, CMF per- For all the compared models, we train each compared forms worse than CDL, DCF and our hybrid model. This method with different percentages(60%, 80% and 95%) may be related to the discussion as in (Agarwal, Chen, and of ratings. We randomly select the training set from each Long 2011), that is, when the side information is spare, dataset, and use the remaining data as the test set. We repeat CMF may not work well. Figure 3 also shows that our hy- the evaluation five times with different randomly selected brid model achieves much better performance than CDL and training sets and the average performance is reported. For DCF, as it takes advantage of our aSDAE models. Conse- our hybrid model, we set the parameters α, β and λ to 0.2, quently, by seamlessly combining our aSDAE models for 0.8, and 0.01, respectively. The learning rate η used in SGD additional side information and matrix factorization for the algorithm is set to 0.004. Similar to (Wang, Wang, and Ye- user-item rating matrix, our hybrid model can handle both ung 2015), we use a masking noise with a noise level of 0.3 the sparse user-item rating matrix and the spare side infor- to get the corrupted inputs from the raw inputs. In terms of mation much better, and learn a much more effective latent deep network architecture, the number of layers is set to 4 factor for each user and item, and hence provides more ac- in our experiments. Moreover, the dimensionality of learned curate recommendation. latent factors for user and item is set to 64. Conclusion Summary of Experimental Results In this paper, we present a hybrid collaborative filtering Table 1 shows the average RMSE of PMF, CMF, CDL, DCF model which bridges our aSDAE and matrix factorization. and our hybrid model with different percentages of train- Our hybrid model can learn effective latent factors from both ing data on the three datasets. We can observe from Table 1 user-item rating matrix and side information for users and that CMF, CDL, DCF and our hybrid model achieves better items. Moreover, the proposed deep learning model, aSDAE, performance than PMF. It demonstrates the effectiveness of is a variant of SDAE and can integrate the side information incorporating additional side information. Moreover, CDL, into the learned latent factors efficiently. Our experimental DCF and our hybrid model outperform PMF and CMF. That results present that our hybrid model outperforms other four is, deep structures can admire better feature quality of side state-of-art algorithms. As part of the future work, we will information. Furthermore, from Table 1, we can see that investigate other deep learning models to replace aSDAE

1314 for boosting further performance, e.g., recurrent neural net- ture extraction. In Proceedings of the 28th international confer- works and convolutional neural networks. ence on (ICML-11), 833–840. Salakhutdinov, R., and Mnih, A. 2008. Bayesian probabilistic References matrix factorization using markov chain monte carlo. In Pro- Agarwal, D.; Chen, B.-C.; and Long, B. 2011. Localized factor ceedings of the 25th international conference on Machine learn- models for multi-context recommendation. In Proceedings of ing, 880–887. ACM. the 17th ACM SIGKDD international conference on Knowledge Salakhutdinov, R., and Mnih, A. 2011. Probabilistic matrix discovery and , 609–617. ACM. factorization. In NIPS, volume 20, 1–8. Chen, M.; Xu, Z.; Weinberger, K.; and Sha, F. 2012. Marginal- Salakhutdinov, R.; Mnih, A.; and Hinton, G. 2007. Restricted ized denoising autoencoders for domain adaptation. arXiv boltzmann machines for collaborative filtering. In Proceedings preprint arXiv:1206.4683. of the 24th international conference on Machine learning, 791– Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adap- 798. ACM. tation for large-scale sentiment classification: A deep learning Shen, Y.; He, X.; Gao, J.; Deng, L.; and Mesnil, G. 2014. A approach. In Proceedings of the 28th International Conference latent semantic model with convolutional-pooling structure for on Machine Learning (ICML-11), 513–520. information retrieval. In Proceedings of the 23rd ACM Inter- Hinton, G. E., and Salakhutdinov, R. R. 2006. Reduc- national Conference on Conference on Information and Knowl- ing the dimensionality of data with neural networks. Science edge Management, 101–110. ACM. 313(5786):504–507. Shi, Y.; Larson, M.; and Hanjalic, A. 2014. Collaborative filter- Hinton, G. E.; Osindero, S.; and Teh, Y.-W. 2006. A fast learning ing beyond the user-item matrix: A survey of the state of the algorithm for deep belief nets. Neural computation 18(7):1527– art and future challenges. ACM Computing Surveys (CSUR) 1554. 47(1):3. Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering Singh, A. P., and Gordon, G. J. 2008. Relational learning via for implicit feedback datasets. In 2008 Eighth IEEE Interna- collective matrix factorization. In Proceedings of the 14th ACM tional Conference on Data Mining, 263–272. IEEE. SIGKDD international conference on Knowledge discovery and data mining, 650–658. ACM. Kavukcuoglu, K.; Fergus, R.; LeCun, Y.; et al. 2009. Learning invariant features through topographic filter maps. In Computer Srebro, N.; Rennie, J.; and Jaakkola, T. S. 2004. Maximum- Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Con- margin matrix factorization. In Advances in neural information ference on, 1605–1612. IEEE. processing systems, 1329–1336. Koren, Y.; Bell, R.; Volinsky, C.; et al. 2009. Matrix factoriza- Su, X., and Khoshgoftaar, T. M. 2009. A survey of collaborative tion techniques for recommender systems. Computer 42(8):30– filtering techniques. Advances in artificial intelligence 2009:4. 37. Van den Oord, A.; Dieleman, S.; and Schrauwen, B. 2013. Deep Lang, K. 1995. Newsweeder: Learning to filter netnews. In content-based music recommendation. In Advances in Neural Proceedings of the 12th international conference on machine Information Processing Systems, 2643–2651. learning, 331–339. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.-A. Lee, D. D., and Seung, H. S. 2001. Algorithms for non-negative 2008. Extracting and composing robust features with denoising matrix factorization. In Advances in neural information process- autoencoders. In Proceedings of the 25th international confer- ing systems, 556–562. ence on Machine learning, 1096–1103. ACM. Lee, H.; Pham, P.; Largman, Y.; and Ng, A. Y. 2009. Unsu- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.;and Manzagol, pervised feature learning for audio classification using convolu- P.-A. 2010. Stacked denoising autoencoders: Learning useful tional deep belief networks. In Advances in neural information representations in a deep network with a local denoising cri- processing systems, 1096–1104. terion. Journal of Machine Learning Research 11(Dec):3371– 3408. Li, S.; Kawale, J.; and Fu, Y. 2015. Deep collaborative filter- ing via marginalized denoising auto-encoder. In Proceedings of Wang, C., and Blei, D. M. 2011. Collaborative topic modeling the 24th ACM International on Conference on Information and for recommending scientific articles. In Proceedings of the 17th Knowledge Management, 811–820. ACM. ACM SIGKDD international conference on Knowledge discov- ery and data mining, 448–456. ACM. Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A three-way model for collective learning on multi-relational data. In Pro- Wang, X., and Wang, Y. 2014. Improving content-based and ceedings of the 28th international conference on machine learn- hybrid music recommendation using deep learning. In Proceed- ing (ICML-11), 809–816. ings of the 22nd ACM international conference on Multimedia, 627–636. ACM. Porteous, I.; Asuncion, A. U.; and Welling, M. 2010. Bayesian matrix factorization with side information and dirichlet process Wang, H.; Shi, X.; and Yeung, D.-Y. 2015. Relational stacked mixtures. In AAAI. denoising autoencoder for tag recommendation. In AAAI, 3052– 3058. Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt-Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feed- Wang, H.; Wang, N.; and Yeung, D.-Y. 2015. Collaborative deep back. In Proceedings of the twenty-fifth conference on uncer- learning for recommender systems. In Proceedings of the 21th tainty in artificial intelligence, 452–461. AUAI Press. ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, 1235–1244. ACM. Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; and Bengio, Y. 2011. Contractive auto-encoders: Explicit invariance during fea-

1315