Arxiv:1606.07659V3 [Cs.LG] 29 Dec 2017 Rate This Item
Total Page:16
File Type:pdf, Size:1020Kb
Hybrid Recommender System based on Autoencoders Florian Strub Jer´ emie´ Mary Romaric Gaudel fl[email protected] [email protected] [email protected] Univ. Lille, CNRS, Centrale Lille, Inria Univ. Lille, CNRS, Centrale Lille, Inria Univ. Lille, CNRS, Centrale Lille, Inria Abstract systems [Wen et al., 2016] to temporal user-item interac- tions [Dai et al., 2016] through heterogeneous classification Proficient Recommender Systems heavily rely on applied to the recommendation setting [Cheng et al., 2016], Matrix Factorization (MF) techniques. MF aims few attempts have been done to use Neural Networks (NN) in at reconstructing a matrix of ratings from an in- CF. Deep Learning key successes have mainly been on fully complete and noisy initial matrix; this prediction is observable data [LeCun et al., 2015] while CF relies on in- then used to build the actual recommendation. Si- put with missing values. This constraint has received less at- multaneously, Neural Networks (NN) met tremen- tention and remains a challenging problem for NN. Handling dous successes in the last decades but few attempts missing values for CF has first been tackled by Salakhutdi- have been made to perform recommendation with nov et al. [Salakhutdinov et al., 2007] for the Netflix chal- autoencoders. In this paper, we gather the best lenge by using Restricted Boltzmann Machines. More recent practice from the literature to achieve this goal. works train autoencoders to perform CF tasks on explicit data We first highlight the link between these autoen- [Sedhain et al., 2015; Strub et al., 2016] and implicit data coder based approaches and MF. Then, we refine [Zheng et al., 2016]. Other approaches rely on recurrent net- the training approach of autoencoders to handle in- works [Wang et al., 2016]. Although those methods report complete data. Second, we design an end-to-end excellent results, they ignore the cold start scenario and pro- system which handles external information. Fi- vide little insights. nally, we empirically evaluate these approaches on The key contribution of this paper is to collect the best the MovieLens and Douban dataset. practices of autoencoder approaches in order to standardize the use of autoencoders for CF tasks. We first highlight autoencoders perform a factorization of the matrix of rat- 1 Introduction ings. Then we contribute to an end-to-end autoencoder based Recommendation systems advise users on which items approach in two points: (i) we introduce a proper training (movies, music, books etc.) they are more likely to be inter- loss/process of autoencoders on incomplete data, (ii) we in- ested in. A good recommendation system may dramatically tegrate the side information to autoencoders to alleviate the increase the number of sales of a firm or retain customers. cold start problem. Compared to previous attempts in that For instance, 80% of movies watched on Netflix come from direction [Salakhutdinov et al., 2007; Sedhain et al., 2015; the recommender system of the company [Gomez-Uribe and Wu et al., 2016], our framework integrates both ratings and Hunt, 2015]. Collaborative Filtering (CF) aims at recom- side information into a unique network. This joint model mending an item to a user by predicting how a user would leads to a scalable and robust approach which beats state- of-the-art results in CF. Reusable source code is provided in arXiv:1606.07659v3 [cs.LG] 29 Dec 2017 rate this item. To do so, the feedback of one user on some items is combined with the feedback of all other users on all Lua/Torch to reproduce the results. items to predict a new rating. For instance, if someone rated The paper is organized as follows. First, Sec. 2 fixes the a few books, CF objective is to estimate the ratings he would setting and highlights the link between autoencoders and ma- have given to thousands of other books by using the ratings of trix factorization in the context of CF. Then, Sec. 3 describes all the other readers. The most successful approach in CF is to our model. Finally, Sec. 4 details several experimental results factorize an incomplete matrix of ratings [Koren et al., 2009; from our approach. Zhou et al., 2008]. This approach simultaneously learns a representation of users and items that encapsulate the taste, 2 State of the art genre, writing styles, etc. A well-known limit of this ap- proach is the cold start setting: how to recommend an item to 2.1 Matrix Factorization based Collaborative a user when few rating exists for either the user or the item? Filtering While Deep Learning methods start to be used for several One of the most successful approach of CF consists of com- scenarios in recommendation system that goes from dialogue pleting the matrix of ratings through Matrix Factorization (MF) [Koren et al., 2009]. Given N users and M items, we vectors. The task with missing data is all the more difficult as th th denote rij the rating given by the i user for the j item. the missing data have an impact on both input and target vec- It entails an incomplete matrix of ratings R 2 RN×M . MF tors. Miranda et al. [Miranda et al., 2012] study the impact N×M of missing values on autoencoders with industrial data. Yet, aims at finding a rank k matrix Rb 2 R which matches known values of R and predicts unknown ones. Typically, they only have 5% of missing values in her dataset, whereas CF tasks usually have more than 95% of missing values. R = UVT with U 2 N×k, V 2 M×k and (U ,V) the b R R [ et al. ] solution of Autorec Sedhain , 2015 handles the missing values by associating one autoencoder per sample, whose input size X T 2 2 2 arg min (rij − ui vj) + λ(kuikF + kvjkF ); matches the number of known values. The corresponding U;V (i;j)2K(R) weight matrices are shared among the autoencoders. Even if this approach is intellectually appealing, it faces techni- where K(R) is the set of indices of known ratings of R,(ui, cal limitations. First, sharing weights among networks pre- vj) are column-vectors of the low rank rows of (U, V) and vent efficient computations (especially on GPU). Secondly, it k:kF is the Frobenius norm. In the following, ri;: and r:;j will prevents gradient optimization methods such as momentum, respectively be the i-th row and j-th column of R. weight decay, etc. from being applied to missing data. In the rest of the paper, we introduce a new method based on a sin- 2.2 Autoencoder based Collaborative Filtering gle autoencoder to fully regain the benefits of NN techniques. Recently, autoencoders have been used to handle CF prob- Although this architecture is mathematically equivalent to a lems [Sedhain et al., 2015; Strub et al., 2016]. Autoencoders mixture of autoencoders [Sedhain et al., 2015], it turns out to are NN popularized by Kramer [Kramer, 1991].They are un- be a more flexible framework. supervised networks where the output of the network aims CF systems also face the cold start problem. The main at reconstructing the input. The network is trained by back- solution is to supplant the lack of ratings by integrating side propagating the squared error loss on the output. More specif- information. Some approaches [Burke, 2002] mix CF with a ically, when the network limits itself to one hidden layer, its second system based only on side information. Recent work output is tends to incorporate the side information into the matrix com- [ def pletion by modifying the training error Adams et al., 2010; nn(x) = σ(W2σ(W1x + b1) + b2); Chen et al., 2012; Rendle, 2010; Porteous et al., 2010]. In this line of research, some papers incorporate NN embedding with x 2 N the input, W 2 k×N and W 2 N×k the R 1 R 2 R on side information. For instance, [Wang et al., 2014] re- weight matrices, b 2 k and b 2 N the bias vectors, and 1 R 2 R spectively auto-encode bag-of-words from movie plots, [Li et σ(:) a non-linear transfer function. al., 2015] auto-encode heterogeneous side information from In the context of CF, the autoencoder is fed with incom- users and items. Finally, [Wang and Wang, 2014] uses con- plete rows r (resp. columns r ) of R [Sedhain et al., 2015; i;: :;j volutional networks on music samples. From the best of our Strub et al., 2016]. It then outputs a vector ^r (resp. ^r ) i;: :;j knowledge, training a NN from end to end on both ratings which predict the missing entries. Note that these approaches and heterogeneous side information has never been done. In perform a non-linear low-rank approximation of R. Using Sec. 3.2, we build such NN by integrating side information MF notations and assuming that the network works on rows into our CF system. ri;: of R, we recover a predicted vector ^ri;: of the form ^r = σ (Vu ): i;: i 3 End-to-End Collaborative Filtering with 0 1 Autoencoders B σ(W1ri;: + b1) C ^ri;: = nn(ri;:) = σ B [W2 IN ] C : Our approach builds upon an autoencoder to predict full vec- B b2 C @ | {z } A tors ^ri;:/^r:;j from incomplete vectors ri;:/r:;j. As in [Salakhut- V | {z } dinov and Mnih, 2008; Sedhain et al., 2015; Strub et al., ui 2016], we define two types of autoencoders: U-CFN is de- The activation of the hidden units of the autoencoder ui itera- fined as ^ri;: = nn(ri;:) and predicts the missing ratings given tively builds the low rank matrix U.