Infusing Collaborative Recommenders with Distributed Representations

Greg Zanotti1, Miller Horvath2, Lucas Nunes Barbosa2 ∗ Venkata Trinadh Kumar Gupta Immedisetty1, Jonathan Gemmell1 1Center for Web Intelligence School of Computing, DePaul University Chicago, IL, USA 2CAPES Foundation Ministry of Education of Brazil Brasilia, DF, BRA gzanotti,millerhorvath,lnbarbosa,vimmedis,[email protected]

ABSTRACT Keywords Recommender systems assist users in navigating complex Collaborative filtering; neural networks; recommender sys- information spaces and focus their attention on the content tems; distributed representations most relevant to their needs. Often these systems rely on user activity or descriptions of the content. Social annota- tion systems, in which users collaboratively assign tags to 1. INTRODUCTION items, provide another means to capture information about Modern websites provide a myriad of conveniences. Users users and items. Each of these data sources provides unique can browse news articles, watch movies, download books benefits, capturing different relationships. or post their latest vacation pictures. The sheer volume In this paper, we propose leveraging multiple sources of of available content can be overwhelming and users often data: ratings data as users report their affinity toward an suffer from information overload. Consequently, websites item, tagging data as users assign annotations to items, and often offer some level of personalization. item data collected from an online database. Taken to- By personalizing a user’s interaction with the website, the gether, these datasets provide the opportunity to learn rich user’s attention can be focused on the content most related distributed representations by exploiting recent advances in his or her interests. Personalization might be achieved by neural network architectures. We first produce represen- allowing the user to manually select topics of interest. For tations that subjectively capture interesting relationships example, a visitor to a news aggregation site might select among the data. We then empirically evaluate the utility of “soccer” and “finance” from a list of available topics. Other the representations to predict a user’s rating on an item and techniques identify a user’s interests by passively observing show that it outperforms more traditional representations. their activity on the site. Finally, we demonstrate that traditional representations can Recommender systems identify relevant content for a user be combined with representations trained through a neural based on interests they have exhibited in the past. For ex- network to achieve even better results. ample, users that have rated popular science fiction movies highly in the past might be recommended the newest sci-fi blockbuster. Recommendation algorithms often work by computing the CCS Concepts similarity between users or between items. To compute this •Information systems → Collaborative filtering; So- similarity, a representation is needed. Users can be repre- arXiv:1608.06298v1 [cs.IR] 22 Aug 2016 cial recommendation; Social tagging; •Computing method- sented as a vector over the item space. Likewise, items are ologies → Neural networks; Ensemble methods; often represented as the users that have consumed them. Representing users and items is this way has proven to be useful. The representations are simple to construct and robust when built from large amounts of data. However, ∗Corresponding Author their inherent simplicity means they fail to capture richer semantic relationships. In this work, we construct alternative user and item rep- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed resentations learned through a neural network. These re- for profit or commercial advantage and that copies bear this notice and the full cita- sulting representations have several benefits. First, they are tion on the first page. Copyrights for components of this work owned by others than relatively quick to create when compared to other neural net- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission work techniques. Second, they capture meaningful semantic and/or a fee. Request permissions from [email protected]. information. Third, by exploiting this semantic information, RecSys-DLRS September 15, 2016, Boston, MA, USA recommender systems can better predict a user’s interests. c 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00 Our experiments on a real-world dataset first offer subjec- DOI: 10.475/123 4 tive evidence that distributed representations capture mean- ingful semantic relationships. These representations can ing television programs [2] or online products [31]. even be manipulated to produce new representations. For This paper extends upon these previous efforts in recom- example, after computing the representation for users, movies, mender systems, distributed representations, and hybrid al- directors, actors and tags in the movie domain, we observe gorithms. Deep learning and neural networks have been that “Dr. No” minus “” plus “” applied to recommender systems before; for example in [34] produces a new representation, most closely related to the the authors propose a deep learning-based framework to fuse representation of “”. The representations appear content and ratings information, while [33] uses a convolu- to capture the notion that Moore replaced Connery in the tional neural network to create a content-based music rec- Bond franchise. ommender. Our model learns distributed representations for Our evaluation then turns to the utility of incorporating users, movies, directors, actors, and tags—contextual infor- these representations into a recommender system. Exper- mation for which representations are useful in their own right iments conducted on the MovieLens 10M Dataset and the (e.g. for finding actors that a user may like, or for recom- Internet Movie Database show that distributed representa- mending tags). These representations can be used in place of tions provide useful information to the system and improve those commonly employed in user-based or item-based rec- the root mean square error of predicted ratings. ommenders, making it easy to adapt existing recommender The rest of this paper is organized as follows. In Sec- systems to our method. We then leverage the results of the tion 2 we present related work. Section 3 describes how we traditional representations with the new representation in a learn the distributed representations. We discuss how we hybrid recommender system. incorporate these representations in a recommender engine in Section 4. Our experimental evaluation is presented in 3. DISTRIBUTED REPRESENTATIONS Section 5 and our experimental results in Section 6. We Distributed representations have a long history of use in conclude the paper with a discussion of our results and di- machine learning due to their generality and flexibility in rection for future work in Section 7. representing higher-dimensional data [11]. The process of creating a distributed representation may add meaningful 2. RELATED WORK constructive properties to elements in the transformed vec- Collaborative filtering algorithms work on the assumption tor space as well. Elements from the space can then be used that users that have behaved similarly in the past are likely as high-quality lower-dimensional features in other learning exhibit similar behavior in the future. User-based collab- problems. orative filtering [15, 30] works by identifying similar users Recently, developments in continuous space language mod- and recommending items these neighbors have liked. Alter- els have produced effective methods to provide meaningful natively, item-based collaborative filtering [5, 28] identifies distributed representations of words and related concepts items similar to those the user has liked in the past. Content- in the domain of natural language processing. In particu- based recommender systems [1, 23] model users and items lar, models utilizing a single hidden layer neural network in a content space, such as the terms in a newspaper article architecture with a specially designed structure have been or the meta-data of a movie. shown to be efficient and accurate on a variety of word sim- In social annotation systems, users described the content ilarity tasks [17, 18]. These architectures avoid nonlinear of online resources by collaboratively assigning tags. For ex- hidden layers, thereby significantly reducing training and ample, a user might annotate Dr. No with “Spy”. Taken inference complexity. The distributed representations pro- together, the annotations of many users create a rich de- duced by these models are of relatively low dimension, create scription of users and items. In such systems, recommenders meaningful neighborhoods of words, and have useful compo- can promote items [7], tags [6, 13] or even other users [37]. sitional properties that capture semantic features [19]. Two Tags have also been used in models to predict ratings [29, models in particular are used to produce these representa- 38] and have been integrated into traditional collaborative tions: the Continuous Bag of Words model and the Skip- filtering [32] and content-based recommenders [4]. gram model. An alternative way to learn features about items and users is to extract them from usage data. Latent features extract 3.1 Continuous Bag of Words via principle component analysis [14] or singular value de- The Continuous Bag of Words model (CBOW) is a neu- composition [9] have benefited recommender systems by in- ral network composed from an input, projection, and out- creasing scalability and improving accuracy [16, 22]. put layer [17]. Input words are represented using one-hot Recent progress in natural language processing and neu- encoding, where the dimensionality of the input vector is ral networks [18] have shown that it is possible to learn dis- the number of words in the vocabulary, V . The projection tributed representations of words [11] that capture impor- layer is of dimensionality R, where R is the dimensional- tant semantic information. For example, when trained on ity of the target distributed representation. The input to the Broadcast News dataset the vector for “Madrid” minus projection layer weight matrix (of dimensionality V × R) “Spain” plus “France” is closer to “Paris” than to any other is shared across input words, so order does not matter—all learned representation [17, 19]. The notion of capital cities words are projected into the same space. The output layer seems to be encoded in the representations. These tech- is of dimension V . niques have been extended to social networks [24], knowl- The training of a CBOW model is optimized to predict edge graphs [35], sentiment analysis [36] and opinion min- the one-hot representation of a target word wk given the ing [12]. surrounding context words in a sentence wk−c, ..., wk−1, Hybrid recommendation systems [3] are a combination of wk+1, ..., wk+c, where c is a fixed window size generally on two or more different approaches. They often improve the the order of a small positive integer. Each word in a sentence accuracy of the recommendations such as when recommend- is sequentially treated as a target word. Multiple sentences Figure 1: Architecture for the Continuous Bag of Word (left) and Skip-gram (right) models. The first predicts the distributed representation of a word based on its neighbors. The second predicts the distributed representations of neighbors given a word. comprise the data used to train a CBOW model. The flow of 3.3 Efficient Update Methods data is described in Figure 1. The one-hot encodings of the Various schemes have been used to address the significant context words are averaged and multiplied by the projection computational overhead of calculating the updates for each matrix to form the hidden layer activations. of the output vector representations in the hidden to output The model is trained by optimizing the likelihood function weight matrix every time a word is run through a CBOW or formed by feeding the output layer activations into the soft- Skip-gram model. Two are commonly used in practice: hier- max function below, where wc is the collection of context archical softmax and negative sampling [17]. Each method words, and σ(k) is the output activation from the kth word has a unique effect on the accuracy of the neural network, in the vocabulary: and some empirical work has been done as to understanding which one performs best in a given predictive task [18]. exp(σ(k)) Pr(w |w ) = (1) Hierarchical softmax uses a binary tree to efficiently rep- k c PV n=1 exp(σ(n)) resent words in a vocabulary [21, 20]. Instead of using a matrix of vectors to represent words, each word is instead The error vector is produced by differencing the one-hot en- represented as a path from the root to a unique leaf in the coding and the vector produced by the softmax function. tree, and normalization can be computed in O(log V ) time. Training then proceeds by backpropagation, generally using Similar words are clustered using simple word-feature algo- a method like stochastic gradient descent for optimization rithms in order to make updates computationally efficient. [26, 17]. Negative sampling is inspired by noise contrastive esti- mation [10]. It deals with the update problem by only up- 3.2 Continuous Skip-gram dating a sample of the output vectors: the updated word, The Skip-gram model, also described in Figure 1, is struc- along with a number of negative samples. A noise distribu- tured in a way similar to the CBOW model, except the in- tion, Pn(w) is used to provide the samples for the current ferential task is reversed: given the target word vk, the Skip- word vector representation being updated. An appropriate gram model is optimized to predict the surrounding context distribution is chosen empirically; for example, a unigram words. The dimensionality of the output layer and input 3 distribution raised to the power of 4 has been used in prac- layer are switched, and the hidden to output layer weight tice [18]. Updates for backpropagation are derived from the matrix shares weights for every output word vector repre- gradients of a loss function based on the noise distribution. sentation. The softmax function acts on the output layer activations for each context word wc,i fed into the network. The softmax functions takes the following form, where the 4. INFUSING RECOMMENDERS WITH wc,i is the ith context word for the target word wk, and DISTRIBUTED REPRESENTATIONS σ(c, i) is the activation corresponding to wc,i: In this section, we propose integrating distributed repre- exp(σ(c, i)) sentations learned from Continuous Bag of Words and Skip- Pr(wc,i|wk) = (2) PV exp(σ(m)) gram models into a collaborative based recommender sys- m=1 tem. We begin by presenting the user and item models. We The error vector for a set of output activations is computed then present the details of user-based and item-based collab- by summing the difference vectors for each word activation orative filtering. We propose extensions to these algorithms in the output layer. to incorporate distributed representations. Finally, we de- scribe how these techniques can be combined using a linear representations from the Skip-gram model we abbreviate it weighted hybrid model. as UBSG. Likewise, two extensions of item-based collaborative filter- 4.1 User and Item Models ing, IBCB and IBSG, are produced by replacing the item There exist many ways to construct user representations representation with those learned in the Continuous Bag of in a web application including recency, authority, linkage or Words and Skip-gram models. vector space models. In this work we focus on the vector space representation [27] and describe users as: 5. EXPERIMENTAL EVALUATION In this section, we describe the data used in our experi- ~u = hw(i1), w(i2)...w(in)i (3) ment. We then discuss our methodology. Finally, we present our evaluation metric and baseline models. where w(i1) through w(in) represent a user’s weights in the vector space. Such vector space representations offer the 5.1 Data opportunity to compute the similarity between users or be- tween items. In this work, we rely on cosine similarity. The MovieLens 10M dataset [25] is made available from When rated items are used to represent the user, these GroupLens Research Group. Released in 2009, it contains weights can be taken as the user’s ratings for the items. 10 million ratings and 100,000 tag applications applied to Similarly, items can be represented as a vector over the user 10,000 movies by 72,000 users. After identifying movies in space and the weights are taken as the ratings users have the MovieLens data, we extracted the actors, directors, and 1 given the item: other metadata from the Internet Movie Database (IMDb) . In order to generate distributed representations, we first merged these two datasets to create artificial sentences. The ~ i = hw(u1), w(u2)...w(um)i (4) sentences included a user ID, movie ID, set of tags applied In this work, we compute alternative user and item rep- by the user to the movie, director, and leading actors. These resentations using the Continuous Bag of Words and Skip- sentences were input into the Continuous Bag of Words and gram models. A user representation takes the same form: Skip-gram models 5.2 Methodology ~u = hw(f1), w(f2)...w(f|F |)i (5) Each user profile was divided equally across five parti- except the weights in the vector space are taken from the dis- tions. Four folds were used as training to build the recom- covered representations where |F | is the number of learned menders. The fifth was used to tune the model hyperparam- features in the representation. Item representation are drawn eters including the size of the user and item neighborhoods, in a similar manner. the length of the distributed representations, the number of epochs to train the representations and the optimal weights 4.2 Predicting Ratings for the linear weighted hybrid. The CBOW and Skip-gram User-based collaborative filtering (UBCF) is a commonly models were both trained using Negative Sampling. used recommendation algorithm. Given a user u and item i The fifth fold was then discarded, so that the hyperpa- a predicted rating p(u, i) is produced by identifying a neigh- rameters tuned on the fifth fold would not be used to create borhood, N, of the k most similar users and leveraging their predictions for it. Four-fold cross-validation was then used ratings, r(n, i), on the item: with the tuned hyperparameters to create predictions for the remaining folds. The results were averaged across all four folds. PN sim(u, n) · r(n, i) p(u, i) = n (6) PN n sim(u, n) 5.3 Evaluation Metric and Baseline The influence of the neighbors in the final prediction is The recommenders were evaluated based on their ability weighted by their similarity to the query user, nearest neigh- to predict a user’s affinity for a movie. Possible metrics bors receiving more importance. include mean absolute error, recall or precision. In this work, Item-Based Collaborative Filtering (IBCF) computes the we report results using root mean square error (RMSE). similarity between items rather than between users. It pre- For each example, in the testing data T consisting of a dicts a user’s rating for an item by exploiting ratings that user u, item i and rating r(u, i), we pass the user and item the user gave to similar items: to a recommender which returns a prediction p(u, i). The RMSE is then computed as: PJ sim(i, j) · r(u, j) p(u, i) = j (7) v PJ u X (p(u, i) − r(u, i))2 sim(i, j) RMSE = u (8) j t n where J is the set of nearest items to the query item i. (u,i)∈T We extend user-based and item-based collaborative filter- By squaring the difference between the observation and ing. The algorithms work as before except that the user and the prediction, RMSE amplifies and severely punishes large item representations are substituted with those learned from errors. the Continuous Bag of Words and Skip-gram models. As baseline models, we use the predictions from the tra- When computing the similarity between users based on ditional UBCF and IBCF models outlined above. the Continuous Bag of Words representations, we abbrevi- ate the model as UBCB. When representing users as the 1www.imdb.com Movies Directors Actors Tags Prisoner of Azkaban (0.92) Alfonso Cuar`on (0.71) Pam Ferris (0.98) best in franchise (0.90) Order of the Phoenix (0.82) David Yates (0.71) Daniel Radcliffe (0.82) love this movie so much (0.86) Chamber of Secrets (0.79) Kazuya Murata (0.67) Harry Melling (0.81) ignores established character (0.85) Half-Blood Prince (0.75) Walter Murch (0.66) Jason Boyd (0.81) potter (0.83) Goblet of Fire (0.75) Rod Hardy (0.65) Richard Griffiths (0.81) emma thompson (0.82)

Table 1: The top five most similar movies, directors, actors and tags along with their similarity to the distributed representation of the movie “Harry Potter and the Philosopher’s Stone” Movies Directors Actors Tags Saving Private Ryan (0.70) Steven Spielberg (0.79) Tom Hanks (0.79) best war cinematography (0.70) Catch Me If You Can (0.69) Lee Unkrich (0.54) Russ Meyer (0.59) adult diaper commercial (0.70) The Terminal (0.67) Robert Zemeckis (0.52) Rebecca Williams (0.59) fun but unrealistic (0.70) Forrest Gump (0.60) John Lasseter (0.51) Stephen Ambrose (0.58) tom hanks (0.69) Lincoln (0.57) Steve Purcell (0.51) Alexander Godunov (0.58) fellowship (0.68)

Table 2: The top five most similar movies, directors, actors and tags along with their similarity to the resulting distributed representation of “Steven Spielberg” plus “Tom Hanks”. Movies Directors Actors Tags Octopussy (0.80) John Glen (0.78) Roger Moore (0.85) desmond llewelyn (0.80) For Your Eyes Only (0.78) Peter Hunt (0.73) Robert Davi (0.77) setting circus (0.80) (0.77) Michael Damian (0.72) Carey Lowell (0.76) maud adams (0.80) The Spy Who Loved Me (0.77) Jean-Claude Van Damme (0.68) Tanya Roberts (0.76) kabir bedi (0.80) A Princess for Christmas (0.77) Harvey Hart (0.66) Michael Lonsdale (0.76) kristina wayborn (0.80)

Table 3: The top five most similar movies, directors, actors and tags along with their similarity to the resulting distributed representation after computing “Dr. No” minus “Sean Connery” plus “Roger Moore”.

6. EXPERIMENTAL RESULTS to produce a new representation. In this way, representa- In this section, we present our experimental results. We tions can be retrieved that share characteristics from both begin by presenting subjective evidence that the distributed of the originals. The ability to manipulate representations representation are capturing relevant semantic information. in this way might enable users to search through information We then demonstrate empirically that a recommender sys- space by selecting multiple movies, directors or actors. tem infused with the semantic information provided by these For example, we have added together the representation distributed representation can produce superior results. of the actor Tom Hanks with the director Steven Spielberg. The result is a new vector of the same dimension. The most 6.1 Capturing Semantic Information similar movies, actors, directors, and tags are shown in Ta- By training distributed representations on the MovieLens ble 2. The first three movies were directed by Spielberg and IMDb dataset, we hope to capture meaningful rela- and starred Hanks. As the similarity decreases, we find For- tionships between users, movies, directors, actors, and tags. rest Gump starring Hanks and Lincoln directed by Spiel- Consider two similar movies. They may be similar with re- berg. The nearest director to the addition of “Tom Hanks” gard to the actors and directors of the movie, with regard to and “Steven Spielberg” is not surprisingly Spielberg him- the tags users have assigned them, with regard to the users self. Afterward, there is a sharp drop off in the similarity that have watched them, or with regard to all of these di- measure from 0.79 to 0.54. The remaining directors have mensions. If two movies are perceived by users to be similar, all worked with Hanks. Likewise, the most similar actor is then their learned representations should also be similar. Hanks. Again there is sharp decline to the second most sim- For example in Table 1 we have listed the five most sim- ilar actor, 0.79 to 0.59. The remaining actors worked with ilar movies, directors, actors and tags to “Harry Potter and either Hanks or Spielberg. The nearest tags are again sub- the Philosopher’s Stone” using representations learned from jective (“best war cinematography”) and descriptive (“tom the Skip-gram model. The resulting movies are all from the hanks”). Harry Potter franchise. Notice that these movies do not The representations can also be combined in some surpris- occur together in any training example, since each example ing ways. Table 3 presents our final example. “Sean Con- contains a single movie. Alfonso Cuar`on and David Yates di- nery” is subtracted from the popular James Bond movie, rected most of these movies. The list of actors include Daniel “Dr. No”. Then, “Roger Moore”, the actor who replaced Radcliffe, the lead actor in the series, as well as supporting Sean Connery in the Bond franchise, was added. The top actors, Pam Ferris and Harry Melling, whose most popu- four resulting movies are Bond films starring Roger Moore. lar roles occur in these movies. The resulting tags include John Glen and Peter Hunt have both directed Bond films. subjective tags such as “love this movie so much” which are Not surprisingly, the representation for Roger Moore is the difficult to judge, but also relevant descriptive tags such as vector most similar to the new representation. The remain- “potter” and “emma thompson”. Users are able to annotate ing actors have all appeared in Bond films. The resulting movies with any tag they wish including actors’ names. tags include the names of actors including Desmond Llewe- It is also possible to combine distributed representations lyn, best known for his role as in 17 of the James Bond Figure 2: Performance of the six recommendation algorithms and the linear weighted hybrid measured as Root Mean Square Error (RMSE) for neighborhoods of size 5, 10, 20, 50 and 100. The best performing individual recommender was the item-based recommender using the Continuous Bag of Words model (IBCB) with a neighborhood of 100, achieving an RMSE of 0.905. The linear weighted hybrid was able to reduce the RMSE to 0.858.

films. UBCF IBCF UBCB IBCB UBSG IBSG This subjective evidence suggests that the distributed rep- α 0.032 0.058 0.144 0.419 0.122 0.225 resentations are capturing interesting relationships between the movies, directors, actors, and tags. It is difficult to quan- Table 4: The weights assigned to each of the com- tify this evidence as there is, for example, no definitive sim- ponent recommenders indicating their contribution ilarity between “Harry Potter and the Philosopher’s Stone” to the linear weighted hybrid with a neighborhood and “Harry Potter and the Prisoner of Azkaban”. However, of size 100. user ratings offer the opportunity to empirically test the quality of the distributed representations. We now turn our attention to experimental results. the commonly employed user-based and item-based collab- orative filtering techniques. 6.2 Evaluating Predicted Ratings Second, all item-based approaches outperform their user- Our experimental results are presented in Figure 2. Among based counterpart. This suggests that, in this domain at the traditional recommender engines we use as our base- least, similarity between items is more informative. It may lines, item-based collaborative filtering (IBCF) performed be that users watch several different types of movies and the best achieving an RMSE of 1.01 when the size of the consequently create noisy profiles. In domains where users neighborhood is set to 5. In both the user-based and item- stick to narrower topics, such as when researchers bookmark based recommender systems, smaller neighborhoods produce journal articles, user-based systems might see an improve- better results. ment. When user and item representations are replaced with rep- Third, traditional approaches prefer small neighborhoods resentations learned through the Continuous Bag of Words of users while the approaches utilizing the distributed repre- model (UBCB and IBCB), the item-based approach again sentations prefer larger neighborhoods. This suggests that outperforms the user-based approach. Compared to IBCF, traditional user-based and item-based collaborative filtering IBCB produces significantly better results with an RMSE algorithms struggle to find relevant neighbors and end up in- of 0.905 when the neighborhood size is set to 100. In both troducing noise into the prediction when the neighborhood UBCB and IBCB, it appears a larger neighborhood offers size is increased. On the other hand, models incorporat- better results. ing distributed representations appear to benefit from large The Skip-gram representations (UBSG and IBSG) im- neighborhoods. It appears they are able to identify many prove upon the traditional approaches, but not as much as relevant neighbors. the Continuous Bag of Words representations. Once again, We now turn out attention to the linear weighted hybrid. the item-based algorithm outperforms the user-based algo- It outperforms all other approaches reducing the RMSE rithm achieving an RMSE of 0.956 for a neighborhood of down to as much as little as 0.858. The hybrid appears size 100. Larger neighborhoods appear preferred. able to exploit the strengths of multiple approaches achiev- A comparison of these results offers several insights. First, ing better results than any individual component can achieve the distributed representations appear to not only capture alone. semantic information as suggested above, but to represent The weights associated with the component recommenders the interests of users and characteristic of items better than are displayed in Table 4. It is not surprising that the best performing individual recommender, IBCB, receives the most Acknowledgments weight, 0.419, in the hybrid. The second best perform- This work was supported in part by the Science Without ing component, IBSG, receives the next highest amount, Borders Program / CAPES, Coordination for the Improve- 0.225. The two user-based recommenders relying on the ment of Higher Education Personnel - Brazil. learned distributed representation, UBCB and UBSG, re- ceive modest weights. Little weight is given to either of the two traditional recommenders, UBCF and IBCF. 8. REFERENCES Several conclusions can be drawn from these observations. [1] M. Balabanovi´cand Y. Shoham. Fab: content-based, First, since the linear weighted hybrid is trained to optimize collaborative recommendation. Communications of the the RMSE of the predicted ratings and more weight is given ACM, 40(3):66–72, 1997. to IBCB than any other component, it seems this model [2] A. B. Barragans-Martinez, E. Costa-Montenegro, J. C. provides the most information. However, other components Burguillo, M. Rey-Lopez, F. A. Mikic-Fonte, and garner significant weights as well. While IBCB is superior A. Peleteiro. A hybrid content-based and item-based to the others, it seems the others still have something to collaborative filtering approach to recommend {TV} offer. The contribution of multiple approaches is what allows programs enhanced with singular value decomposition. the hybrid to outperform the individual components. Information Sciences, 180(22):4290 – 4311, 2010. Second, while it appears the Continuous Bag of Words [3] R. Burke. Hybrid recommender systems: Survey and model is better suited to predict ratings in this domain, experiments. User modeling and user-adapted the Skip-gram model also outperforms the traditional tech- interaction, 12(4):331–370, 2002. niques. Moreover, since the linear weighted hybrid draws [4] M. de Gemmis, P. Lops, G. Semeraro, and P. Basile. upon the Skip-gram models significantly (0.122 and 0.225), Integrating tags in a semantic content-based it suggests that the Skip-gram model is learning something recommender. In Proceedings of the 2008 ACM that the Continuous Bag of Words model is not. Conference on Recommender Systems, RecSys ’08, Finally, since such little weight is given to the models us- pages 163–170, New York, NY, USA, 2008. ACM. ing traditional representations of users and items, it seems [5] M. Deshpande and G. Karypis. Item-based top-n that they add little additional information to the hybrid. In recommendation algorithms. ACM Transactions on short, the distributed representations contain most of the Information Systems (TOIS), 22(1):143–177, 2004. information of the traditional ones and more. [6] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke. Hybrid tag recommendation for social annotation systems. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 829–838, 7. CONCLUSIONS New York, NY, USA, 2010. ACM. In this work, we proposed infusing traditional user-based [7] J. Gemmell, T. Schimoler, B. Mobasher, and and item-based collaborative recommender systems with rep- R. Burke. Resource recommendation in social resentations learned by applying the Continuous Bag of Words annotation systems: A linear-weighted hybrid and Skip-gram models to a dataset of users, movies, direc- approach. Journal of Computer and System Sciences, tors, actors and tags. We first presented collaborative fil- 78(4):1160 – 1174, 2012. tering. We then presented two approaches for learning dis- [8] J. Gemmell, T. Schimoler, B. Mobasher, and tributed representations and described how these represen- R. Burke. Resource recommendation in social tations could be exploited by collaborative recommenders. annotation systems: A linear-weighted hybrid Subjective evidence supports the notion that the discov- approach. Journal of computer and system sciences, ered distributed representation are capturing interesting se- 78(4):1160–1174, 2012. mantic relationships. Similar items can be retrieved even [9] G. H. Golub and C. Reinsch. Singular value though they do not occur together in any training example. decomposition and least squares solutions. Numerische For example, the most similar movie to “Harry Potter and mathematik, 14(5):403–420, 1970. the Philosopher’s Stone” was “Harry Potter and the Prisoner [10] M. U. Gutmann and A. Hyv¨arinen. Noise-contrastive of Azkaban”. estimation of unnormalized statistical models, with A rigorous experiment on a real world dataset demon- applications to natural image statistics. J. Mach. strates that the Continuous Bag of Words and Skip-gram Learn. Res., 13(1):307–361, Feb. 2012. representations not only produce better ratings predictions than traditional representations, but appear to subsume them, [11] G. E. Hinton. Distributed representations. 1984. capturing all the information they do and more. When in- [12] O. Irsoy and C. Cardie. Opinion mining with deep cluding these various techniques in a linear weighted hybrid, recurrent neural networks. In EMNLP, pages 720–728, even better results can be achieved. 2014. Future work will explore additional datasets. We are inter- [13] R. J¨aschke, L. Marinho, A. Hotho, ested in whether these results hold for all domains or if dif- L. Schmidt-Thieme, and G. Stumme. Tag ferent domains, such as music or news articles, are easier tar- recommendations in folksonomies. In European gets for traditional or Skip-gram representations. Finally, we Conference on Principles of Data Mining and plan to explore extensions of the Continuous Bag of Words Knowledge Discovery, pages 506–514. Springer, 2007. and Skip-gram models themselves to ascertain whether or [14] I. Jolliffe. Principal component analysis. Wiley Online not the structure of ratings, tagging or content data can be Library, 2002. exploited. [15] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: applying L. Schmidt-Thieme. Tag-aware recommender systems collaborative filtering to usenet news. Communications by fusion of collaborative filtering algorithms. In of the ACM, 40(3):77–87, 1997. Proceedings of the 2008 ACM Symposium on Applied [16] Y. Koren, R. Bell, C. Volinsky, et al. Matrix Computing, SAC ’08, pages 1995–1999, New York, factorization techniques for recommender systems. NY, USA, 2008. ACM. Computer, 42(8):30–37, 2009. [33] A. van den Oord, S. Dieleman, and B. Schrauwen. [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Deep content-based music recommendation. In Efficient estimation of word representations in vector C. J. C. Burges, L. Bottou, M. Welling, space. arXiv preprint arXiv:1301.3781, 2013. Z. Ghahramani, and K. Q. Weinberger, editors, [18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and Advances in Neural Information Processing Systems J. Dean. Distributed representations of words and 26, pages 2643–2651. Curran Associates, Inc., 2013. phrases and their compositionality. In Advances in [34] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative neural information processing systems, pages deep learning for recommender systems. In Proceedings 3111–3119, 2013. of the 21th ACM SIGKDD International Conference [19] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic on Knowledge Discovery and Data Mining, KDD ’15, regularities in continuous space word representations. pages 1235–1244, New York, NY, USA, 2015. ACM. In HLT-NAACL, volume 13, pages 746–751, 2013. [35] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge [20] A. Mnih and G. Hinton. A scalable hierarchical graph and text jointly embedding. In EMNLP, pages distributed language model. In In NIPS, 2008. 1591–1601. Citeseer, 2014. [21] F. Morin and Y. Bengio. Hierarchical probabilistic [36] X. Zhang and Y. LeCun. Text understanding from neural network language model. In AISTATS ’05, scratch. arXiv preprint arXiv:1502.01710, 2015. pages 246–252, 2005. [37] Z.-D. Zhao and M.-S. Shang. User-based [22] A. Paterek. Improving regularized singular value collaborative-filtering recommendation algorithms on decomposition for collaborative filtering. In hadoop. In Knowledge Discovery and Data Mining, Proceedings of KDD cup and workshop, volume 2007, 2010. WKDD’10. Third International Conference on, pages 5–8, 2007. pages 478–481. IEEE, 2010. [23] M. J. Pazzani and D. Billsus. Content-based [38] N. Zheng and Q. Li. A recommender system based on recommendation systems. In The adaptive web, pages tag and time information for social tagging systems. 325–341. Springer, 2007. Expert Systems with Applications, 38(4):4575 – 4587, [24] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: 2011. Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 701–710, New York, NY, USA, 2014. ACM. [25] J. Riedl and J. Konstan. Movielens dataset, 1998. [26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by back-propagating errors. L. Nature, 323(533.536), 1986. [27] G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [28] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295. ACM, 2001. [29] S. Sen, J. Vig, and J. Riedl. Tagommenders: connecting users to items through tags. In Proceedings of the 18th international conference on World wide web, pages 671–680. ACM, 2009. [30] U. Shardanand and P. Maes. Social information filtering: algorithms for automating “word of mouth”. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 210–217. ACM Press/Addison-Wesley Publishing Co., 1995. [31] T. Tran and R. Cohen. Hybrid recommender systems for electronic commerce. In Proc. Knowledge-Based Electronic Markets, Papers from the AAAI Workshop, Technical Report WS-00-04, AAAI Press, 2000. [32] K. H. L. Tso-Sutter, L. B. Marinho, and