<<

MERGING CONSUMPTION DATA FROM MULTIPLE SOURCES TO QUANTIFY USER LIKING AND WATCHING BEHAVIOURS Sashikumar Venkataraman, Craig Carmichael Rovi Corporation

Abstract further predisposed to watch certain shows and channels often based on pure habit, We describe an integrated platform that content they are already aware of, or the sheer aggregates consumption data from multiple popularity of a show, and these aspects are sources towards building a unified model for usually unaccounted in the recall score of a collaborative filtering for more accurate user recommender system. and content representations. The goal is to provide a framework that combines various In this paper we discuss a unified signals spanning explicit ratings, implicit framework that aggregates consumption data information of watching behaviors and meta- from multiple sources and fuses them with content information in a single model that meta-content to create an efficient potentially goes beyond the usual goal of recommendations framework capable of both maximizing consumption and incorporates significant discovery and high precision. First, metrics that capture “likeness” and we begin with the cold start problem where no “discovery”. We also feed the usage data consumption data is available, and create a back into meta-content to determine more baseline recommendation model that applies accurate content representations that aid in word-to-vector [6] factors solely from targeting content-based recommendations metadata content. An aggregation process is more effectively. used to accumulate these word-level vectors to content level factors for each show and channel potentially consumed. Next, we derive the usage factors for each media asset INTRODUCTION considering both explicit and implicit ratings from various sources. While the explicit Recommendation systems are becoming an ratings provide a more direct notion of increasingly key component in many media "likeness", the implicit signals only provide a and entertainment content retrieval systems by measure of watching behavior. We discuss providing a powerful, efficient method for methods to correlate these notions of likeness users to easily sift through large catalogs of and watching attributes in a more formal way. media content and finding valuable programming [1-3]. Within the last couple of A central part of the framework for decades, several algorithms have been merging consumption data from multiple developed that leverage metadata, such as sources is the Rovi Knowledge Graph that information on programs, casts and genres, incorporates factual information of all and user consumption data to provide users ‘known’ or ‘named’ things. This includes all with more targeted content recommendations movies and TV shows, music albums and through these recommendation systems [4-5]. songs, as well as all known people such as However, in most systems, much of the actors, musicians, celebrities, music bands, viewership data is typically implicit in nature, known companies and businesses, places, and models that are based on over-simplified sports teams, tournaments and players, etc. estimations of viewing behavior are often not All the facts pertaining to these entities are wholly comprehensive and intuitive. A user is synthesized from multiple publicly available

2016 Spring Technical Forum Proceedings sources such as Wikipedia, Freebase, and OVERALL ARCHITECTURE many others and correlated so as to create a unique smart tag (with a unique identifier) to We begin by describing the key steps represent each entity in the Rovi Knowledge involved in merging deep and dynamic Graph. The main utility of the knowledge metadata with usage data across various data graph for our problem is in aiding to merge sources. These steps are described in brief information across different data sources. We below and will be discussed in further detail have a merge system that can take any data in the following sections. source and does a best-effort to merge the entities in that data source with the underlying Step 1: We first merge the assets from the knowledge graph. Such a system makes the data source into the central knowledge graph, aggregation easier since different data sources to enable infusing of usage and meta- have variations in referring to entities and information to and from the knowledge graph. carry different level of meta-information. This step serves a dual purpose. Firstly, it helps in augmenting the meta-content of the One often encounters rich meta-content assets in terms of keywords, genres and deep- associated with media assets, such as genre, descriptors from the knowledge graph and keywords, and description. However, the hence aids in getting more accurate factors relevance or weight of each individual piece derived from meta-content representations. of meta-content (for finding similar movies or Secondly, it aids in merging the usage factors recommendations) is often lacking, missing or from this data source with usage factors from wrong due to multiple sources, algorithms, other data-sources that are also merged with and/or manual entry. For example, a show is a the central knowledge graph. comedy but exactly how funny and how it impacts other comedies is more of a viewing Step 2: Using word-representations, we next sentiment. Usage data, on the other hand, determine the meta-content factors provides a different kind of information in corresponding to the media-assets for cold- conveying what programs co-occur in start baseline with no or minimal usage data. watching behavior across users. Analysis of The word2vec model developed in [1] is used usage data involves very different techniques to determine the word-representations and algorithms from those for analyzing meta- corresponding to each of the meta-content content. There has been some effort in the information and these are aggregated to form context of recommendations wherein one uses a vector for each asset in K-dimensional meta-content to filter in the post-process of vector space. the collaborative filtering algorithm or mixes results coming from collaborative filtering Step 3: Next, we build a model that merges and meta-content algorithms. However, no implicit and explicit information from prior efforts make use of usage data to better multiple data sources and fuse them into the understand metadata relevance and enrich meta-content factors to create more accurate meta-content directly. It would be desirable if asset representation. This naturally results in a usage data along with the implicit/explicit model to estimate “likeness” from user- ratings of users could be leveraged to watching behavior even in absence of explicit determine the relevant weights of different information. pieces of meta-content. Step 4: Finally we feed the usage data back into meta-content in the form of coefficient weights of each individual meta-content factor involved in each asset. This enables the

2016 Spring Technical Forum Proceedings creation of more accurate representation of the associated with a genre, using movies known individual meta-content factors that further to have that genre, and then predicting them increase the precision in content-based on other movies that have a strong overlap in recommendations. keywords with the genre keywords.

MERGING INTO KNOWLEDGE GRAPH The core part of building the Knowledge Graph is a “merge” function that takes in any The Rovi Knowledge Graph is a dynamic source and tries to map the entities in that system that has been created by synthesizing source with the entities in the existing graph. multiple metadata sources and evolving and Whenever a new entity, such as movie or refreshed continuously over time. Each smart personality, is merged with an existing smart tag in the Knowledge Graph has rich meta- tag, the metadata corresponding to the smart content built by a combination of manual tag gets augmented from the entity. If the new tagging and automatic aggregation from entity does not get merged with any smart tag, multiple sources along with several machine- then a new smart tag is created for the entity learning algorithms. in the KG. An important aspect of the merging is the allowance of slight variations Twitter, in the meta-content fields between the two RottenTomatoes, Facebook, assets. For example, the titles can differ , etc. Google Trends, slightly due to several reasons. One reason for Nielsens, etc. SOCIAL REVIEWS ESPN, NFL, the difference in titles is the fact that some NBA, sources put the season number and episode

SPORTS numbers in the title, while other sources only Reuters, Knowledge NEWS Graph put the episode title in the title. Sometimes CBS, NBC, they may differ due to some lexical error or BBC, OTT YouTube, absence of common words like articles. VOD Amazon, Release year could also vary by one or two + WIKIS Hulu, Vudu. units, and sometimes cast members could be be missing. All such variations are considered Wiki, Freebase, during the merge process by considering all TMDB, etc. the fields simultaneously and coming with a

Fig 1: Merging of assets from multiple data sources combined match score considering all fields. into a central knowledge graph Following are 2 examples of approximate matching with inexact titles which got Fig 1 below shows the merging of matched due to matching of other fields: information from various sources into the central knowledge graph. While certain meta- BUILDING WORD-REPRESENTATIONS content fields, such as cast-members, roles, FROM META-DATA and release year are unambiguous, while other fields such as genres and keywords are more We now describe the step of determining subjective. These fields are assigned using a the baseline factors for the media-assets from combination of manual tagging and automatic meta-content. These factors are especially aggregation from multiple sites like useful in cold-start situations with no or Wikipedia and Freebase. Genres can also be minimal usage information. Over time the assigned by tagging directly from keywords usage information from multiple data sources gleaned from descriptions or plots. This is play a key role in how the asset factors done using machine-learning algorithms by evolve; nonetheless the meta-content factors first determining the set of keywords remain valuable to bias recommendations

2016 Spring Technical Forum Proceedings towards more meta-content oriented discovery optimally determine those weights in a later without losing high precision. In our section. collaborative filtering algorithm, we represent meta-content information of each unique MULTI DATA-SOURCE media-asset as a weighted combination of COLLABORATIVE FILTERING MODEL individual meta-content pieces, such as genre, category, and keywords. Each individual As mentioned earlier, one of the main piece of meta-content can be represented as a challenges in doing collaborative filtering is vector in a K-dimensional vector space (K is the lack of explicit ratings in dealing with usually 100-300). Each asset vector is then a usage data from many data sources. Most of weighted sum of individual vectors and hence this usage data results from linear TV also a vector in this space. The individual watching behavior where the users are limited vectors could have been determined in content choice and no explicit feedback is independently by other known algorithms provided that indicates how much the user based on co-occurrences of terms in large liked a particular show or movie. We also get corpus (such as word2vec [6]). (to a lesser extent) usage data that involves VOD (video on-demand). Though the signals While a naïve algorithm in determining the from VOD (and DVR) do provide a greater asset vector could be to just average over all correlation to user liking behavior, they too the vectors of keywords and genres, there are are not as accurate as explicit ratings due to several problems in this approach. Firstly, not the limitation in the number of media assets in all keywords are of same importance and typical VOD catalogs. At best, they capture some keywords may be noisy and unrelated to the kinds of genres and categories that the the theme of the movie. Secondly, the genre user typically watches, but fails to capture vectors are not readily available in the more fine-grained likeness or dis-likeness word2vec model and needs to be explicitly within a category. So the most valuable computed from the movies. To address the information that we get is the usage data from first problem, we process the movie keywords a couple of sources that involve explicit to form keyword clusters based on the ratings. However, the amount of this kind of closeness of the keyword-vectors and then usage data is a lot less compared to the usage filter those keywords that don’t fall in a data relating to linear TV and VOD. The significant weighted cluster. For the second challenge then is to build a model that can problem, we treat a genre as a collection of unify all these forms of usage data and movies and hence cluster all the keywords provide a model to carry over information across all the movies belonging to that from data source to augment the usage factors category. The top keyword clusters found by in another source. this clustering are then used to find the final genre vectors. After we determine the There are several models existing that keyword and genre vectors, we can then find convert the ratings (explicit or implicit) to the final asset vectors as a linear combination similarities between media assets. These of the corresponding keyword and genre include Pearson correlation coefficient, cosine vectors. While these asset vectors are a good similarity, log-likelihood and jaccard starting point in our collaborative filtering coefficient. Yet another interesting coefficient model, it may not be the most accurate is a notion of probabilistic similarity as representation due to the lack of knowledge of discussed in [7] referred as ProbSim in the genre and keyword cluster weighting sections below. During merging ratings from coefficients. We will discuss techniques to multiple sources, the data sources explicit ratings are given a higher weight than data

2016 Spring Technical Forum Proceedings sources with implicit ratings. We also treat popularity of the asset, etc. The goal is then to VOD/DVR watching as a more explicit intent build a model that uses these signals and uses and give higher weight to those data points. a model to create an optimal expression for The final result of merging all these forms of “implied rating” or “likeness”. For this usage data is a NxN similarity matrix where purpose, we create a model that translates the each element (i, j) gives the similarity implicit ratings to similarities and then match between the items i and j. The similarity them with the similarities obtained from matrix can either be created with only explicit explicit ratings. Figure 2 shows the model that ratings (referred as XSim(i, j)) or implicit is used for this purpose. ratings (referred as ISim(i, j)) or combining both (referred just as Sim(i, j)). Error Our next step is to determine the factors for the media-assets that most closely match these similarities derived from the usage data. The starting point for the asset vectors are the factors determined in the previous section from the word-representations corresponding to the meta-content. The asset vectors are then let to vary so as to match the usage similarities as much as possible in an iterative fashion. For every item-item pair, the corresponding item factors are made to come closer or away from each other based on how much the cosine distance between the asset vectors match the computed item-item I X X similarity. These final asset vectors represent a more accurate representation that reflects usage based similarity and simultaneously e e e remaining as close to the meta-content representation as possible. Hence, these form an ideal representation for hybrid recommendations. Fig 2: Merging usage from multiple data sources with explicit and implicit ratings We next use these similarities to create a model to determine explicit user likeness from In the above figure, we aggregate the raw implicit watching behavior. While explicit user event data in the first or bottommost ratings were unambiguously absorbed in the layer, L1. Examples of explicit indicators similarity computation, implicit watching include specific data that contain explicit were accompanied with several signals that information such as ratings on the scale of 1-5 needed to be translated to some form of in increments of a half-star, or a like/not (0/1) implicit rating in a systematic fashion. Some binary indicator. This would comprise an of the these signals include duration of explicit (input) vector Xui (d). Example of watching, number of times/episodes watched, implicit indicators for user-item interaction number of similar items user has watched would include implicit watching signals such around that asset, the price the user paid for as the ones described earlier. This would watching, the average rating users have rated comprise an implicit vector I. The next layer that particular item in other data sources,

2016 Spring Technical Forum Proceedings L2 contains a mapping from consumption Dimplicit N N v Error = ∑ ∑∑()XSim(i, j)− ISim(i, j,d) 2 details provided by L1 (both Iui (d) and v d=1 i=1 j=1 X (d) ) to the output to a form of preference ui Then we can take the error, and propagate or likeness of a user u to item i, rui (d) in the it backwards, layer by layer until the data source d. If the data set d involved is an derivatives are estimated across the trainable v I explicit one, the mapping is often weights W . The iterations are performed straightforward. However, for implicit data until the error is minimized to below a certain sets, the mapping is more challenging and the threshold. We then use those weights as the mapping would be in the form of a optimal coefficients to combine the implicit (preference) model taking various forms signals and create an implicit rating. that ultimately depends on a set v of trainable weights W . Different functions FUSING USAGE INFORMATION INTO META-CONTENT can be used for various embodiments of fI v which represent trainable W in different The collaborative filtering model used here ways. For example, we can use a linear is referred as Weighted Vector Collaborative estimator with a sigmoid function at the Filtering (WVCF) where the meta-content output node, a general regression neural net, a information of each media-asset is represented random forest, or various others. as a weighted combination of individual meta- content pieces, such as genre, category, and The predicted ratings are then passed to a keywords and each individual meta-content similarity layer L3, that takes the preference piece is treated as a vector in a K-dimensional detail estimates for each (u,i,d) across all vector space. A meta-content similarity of the common media assets and users, and produces two assets is then modeled as a function of a similarity estimate between media assets i these individual meta-content pieces of and j . While different kinds of similarities information (such as a dot product). However, (discussed earlier) can be used, we use the weight coefficients for each individual similarities based on a weighted Pearson piece of information are usually not known coefficient as mentioned below: apriori and our goal is to use usage information to best predict these weights. U (d) ∑ ()X − X ()X − X Once these weights are determined, we can rui (d) ri (d) ruj (d) rj (d) Sim(i, j, d) = u∈(i, j) create more accurate item-item similarities U (d) U (d) 2 2 X − X X − X ∑ ()rui (d) ri (d) ∑ ()ruj (d) rj (d) and thereby more accurate recommendations. u∈(i, j) u∈(i, j) It is assumed that whether similarities are Importantly, the WVCF baseline model determined using implicit or explicit data, consisting of a single trained vector per asset they should roughly equate to the same values can be broken up into two finely tuned per (i, j) pair across multiple data sets. So predictors in the modeling pipeline, one that first we obtain an estimate XSim(i,j) based on targets aspects of watching (WVCF-watch), the usage data among data sources with and one that targets liking characteristics of explicit ratings. This is then compared with the user (WVCF-like). Later in the results the data sources with implicit ratings and a section is an example of recall performance difference is then used in the final layer L4 to using Gradient Boosted Trees as a local compute the error between the two observed corrector to target both sides of precision similarities as shown below: (liking) and recall (watching).

2016 Spring Technical Forum Proceedings As described in the previous section, the We next break down the asset vectors into usage information is separately modeled to individual metadata components consisting of produce item-item similarity wherein items keywords and genres: watched together and similarly evaluated/rated (common sentiment) across a = vgenreswgenres + vkeywkey if i if i if multiple users have better usage-similarity. genres in i This similarity also takes into account the genres = genre genre wif ∑ vg wgf user’s sentiment along with viewing g∈i information (i.e. both watched and similarly keywords in i keywords = keyword keyword liked/reviewed as opposed to only watched). wif ∑ vk wkf All similarities are in some sense based in part k∈i on co-occurrence; usage-based similarity is where w denotes the total genre and keyword the co-occurrence of users watching the same if program, while meta-content similarity is the factors, and vi denotes the corresponding co-occurrence of some meta-content (crew or weights of those fields. The model continues genre or keyword). The strength of co- to extend so far as metadata is available. occurrence in the meta-content case is based Unknown components can also be added to on the weights of the individual meta-content address when metadata is lacking or information within that asset, while the unavailable. We then break down the genre strength of co-occurrence in usage-based and keyword vectors further into factors similarity is governed by the user corresponding to individual genre and sentiment/rating. keywords. Next we minimize the error function using stochastic gradient descent that The system then tries to align the media changes the weights of the individual meta- asset vectors as close as possible to the usage- content components so that the net error based similarities. An error function is then between the meta-content similarities and constructed that compares the modeled meta- usage-based similarities is minimized. After content similarity to the usage-based iterating over all the usage data, the individual similarity (based on co-occurrence combined meta-content weights are stored as the best with sentiment factors). predictors for the corresponding meta-data 2 relevance for the media-asset. = ∑()− E sij mij ij RESULTS where sij denotes the observed “sentimental” similarity between items i and j (as Table 1 below shows 10 examples of the determined from usage data discussed in most similar shows based on the cosine previous section) and m denotes the modeled similarity between sentiment vectors prior to ij training the model using only a small subset similarity based on the asset meta-content of the terms in each a . This is using only vectors. The objective in this error if minimization is to explain the observed raw source combined with weighted word sentimental similarities between many asset vectors without the use of metadata filtering pairs with a model that also has deeply based on advanced tags, moods or deep embedded metadata. The modeled similarity descriptors. So the starting point aif used for mij defined by the dot product of asset vectors training each media asset vector is already capable of overcoming cold-start issues. av and av over all the latent factors: i j = mij ∑aif a jf f

2016 Spring Technical Forum Proceedings Recall@K Table 2 below and Figure 3 demonstrate Model the trained accuracy of our model for various K=5 K=10 K=25 K=50 K=100 number of meta-content factors. Recall@K Probsim 0.192 0.286 0.445 0.551 0.681 WVCF, for various values of K was computed over a F=20 0.163 0.249 0.365 0.452 0.575 VOD data set. Note the objective here was to WVCF, show that precision and recall can be F=50 0.163 0.266 0.392 0.495 0.626 contained within a vector rather than a WVCF, similarity matrix without significant loss, F=100 0.186 0.289 0.425 0.528 0.664 while holding metadata seamlessly across WVCF, F=300 0.206 0.296 0.432 0.551 0.678 usage spaces as vectored sub-components. Further improvements are seen using WVCF- Table 2: Recall performance of WVCF vs Probsim watch and WVCF-like (beyond the vector). With the vector, other training was based on 15K assets and 1 million users over the span 0.7 of a year. It appears little is lost in 0.6 performance when comparing the current method to a purely usage based approach such 0.5 as ProbSim, which has proven to be a top 0.4 Probsim method in recommender systems space in Recall@K 0.3 WVCF, F=20 terms of precision/recall. WVCF, F=50 0.2 WVCF, F=100 WVCF, F=300 Embedded into our model is the potential 0.1 K=5 K=10 K=25 K=50 K=100 to also normalize to a space that can be consistent across multiple usage data sets. So Fig 3: Recall performance of WVCF baseline vs when recommending an asset that is either Probsim for various factors, prior to local affects completely new or not yet popular, this model will tend to significantly outperform purely usage-based approaches. The difference between WVCF and ProbSim becomes more prominent when we Title 1 Title 2 consider the “discovery” factor along with Top Chef Top Chef: Texas recall precision. This factor is based on how Scooby-Doo! Pirates Scooby-Doo and the Alien many overall programs of the corpus get Ahoy! Invaders recommended in recall to a set of users. It has 28 Weeks Later Resident Evil: Apocalypse been noted that models like ProbSim have an Pirates of the Caribbean: On Pirates of the Caribbean: The extremely low diversity since most of the Stranger Tides Curse of the Black Pearl recommendations are based from the top 1% Kansas City SWAT Detroit SWAT popular programs. 2 Stuart Little Modern Vampires Vampire in Brooklyn Table 3 shows the top 15 programs from Mad Max Death Race 2000 the bottom quartile of the MovieLens (20M, Godzilla vs. shows having less than 100 ratings discarded) Mechagodzilla II Terror of Mechagodzilla Attack of the Killer Return of the Killer dataset using the total number of positive and Tomatoes Tomatoes negative ratings and sorting by the ratio of positive ratings to the total number of ratings. Table 1: Example most similar shows using meta- While these “hidden gems” are virtually never content tags before training usage data recommended by the ProbSim model, they do have a high positive rating (though overall number of ratings is small), and are therefore

2016 Spring Technical Forum Proceedings appropriate for more targeted user- recommendations. This is versus at least 4% 1 hidden gem recommended for a modified Average User approach, and thus capable to have a better 0.8 userId=30236 spread in recommendations w.r.t. the 0.6 discovery of “long-tail” programs. userId=53971 0.4 userId=156

Figure 4 shows the variation among users Watching of Probability 0.2 in MovieLens [9] for their preference in 0 watching different grades of popular shows. Q1 Q2 Q3 Q4 Recommender systems targeted to high recall Quartile and high precision tend to overshoot for the Fig 4: Popularity effect by quartile in MovieLens most popular shows (Q1) to score well on recall (and precision automatically since there To address the above issue, we introduce a are not many “2” or lower ratings for most penalty for not matching the user’s overall popular shows), while not discovering quality popularity preferences over multiple bins shows at lower popularity. using a loss such as: Nbins G(i) B(i) LOSS(u) = 1−α P(u,b) − p(u,b) Rating +ve -ve P(i) = ∑ b=1 Title(i) N(i) Avg(i) rating rating G(i)/N(i) Intimate Averaged over all u users, with α =1 and Strangers 126 3.655 117 2 0.9832 N =10 popularity bins, for example, the Follow the bins Fleet 118 3.712 111 2 0.9823 new recall score is normalized to better match Ax, The r (couperet, user distribution P(u) with the recommended Le) 114 3.807 107 2 0.9817 r One Man distribution p(u) . This penalty reduces Band 105 3.895 98 2 0.9800 Probsim’s recall score over MovieLens from Angels' Share 105 3.710 96 2 0.9796 roughly 0.3 to 0.1, with Q4 hidden gems still Queen of at nearly 0%. Furthermore, different shows Versailles, The 144 3.628 137 3 0.9786 and movies have quite different popularity Yes Men characteristics across data sets, both explicit Fix the World, The 194 3.843 179 4 0.9781 and implicit. To this extent, recall scores Big Clock, should be penalized even further to remove The 162 3.756 153 4 0.9745 Hundred- arbitrary popularity artifacts associated with Foot each data set. Although the choice of α was Journey, The 156 3.750 143 4 0.9728 arbitrary in this case, without having a Short Film “budget” to place recommendations into About popularity bins which fits known user Love, A 224 4.025 214 6 0.9727 Our Man in preferences, the model otherwise appears Havana 153 3.748 142 4 0.9726 lacking in diversity and drastically under- Pixar Story, The 186 3.747 177 5 0.9725 recommends hidden gems, for the short- Criss Cross 115 3.639 103 3 0.9717 sighted purpose of scoring high recall. Advise and Consent 183 3.811 170 5 0.9714 Something If the model were tweaked for purely high the Lord recall, where watching is predicted rather than Made 214 3.895 199 6 0.9707 liking (to find hidden gems), Figure 5 below Table 3: 15 Hidden Gems in the fourth quartile of shows example performance of WVCF-watch popularity in MovieLens along with liking and the combination of

2016 Spring Technical Forum Proceedings like/watch. Figure 6 shows the precision 1 equivalent. Here Gradient Boosted trees take 0.9 the baseline WVCF (vector only), one movie 0.8 at a time, and apply the baseline predictor as 0.7 an attribute for that movie. It then expands 0.6 0.5 the input deck to also include many attributes SVD(F=100) 0.4 Recall@K Probsim in the surrounding neighborhood of ratings. 0.3 WVCF-watch This produces a very strong estimator targeted 0.2 WVCF-like 0.1 to watching. In this case, Figure 5 shows WVCF-like/watch SVD (F=100) versus Probsim versus WVCF- 0 watch for the (budgeted) popularity bin of the 0246810 K top 25 most popular shows. Recall/Precision points were tallied based on each user/item Fig 5: MovieLens Recall@K for top 25 Most sample in the cross validation being Popular bin during “budgeted” recall somewhere in that bin (with the predictor answering which movie was watched of the 0.99 25). The important thing here is that with SVD(F=100) both WVCF-watch and WVCF-like, the 0.985 Probsim 0.98 WVCF-watch underlying recommender engine can control 0.975 WVCF-like effects such as the dial for liking and 0.97 WVCF-like/watch watching, compared to other models with high 0.965 0.96 precision/recall that only provides a single Precision answer (both like and watch simultaneously). 0.955 0.95 0.945 Table 4 below shows the results of 0.94 applying the 208 word emotions vocabulary 0246810 from [10] using pre-trained word vectors to K MovieLens usage within a modified SVD framework. As these moods serve as Fig 6: MovieLens Precision for top 25 Most Popular additional metadata, the experiment here bin during “budgeted” recall shows the power of Word Vectors fused into the model. The training process tuned the relevance of each vocabulary word vector to Rank Overall Popular Gems each movie, within SVD latent movie factors, 1 contrary lazy honest 2 angry inconsiderate contrary to best model explicit likes. For top word 3 brilliant pity cordial ranking, interesting patterns emerge between 4 wonder obnoxious powerful overall, most popular, and gems. For example 5 bored annoyed lucky “contrary” and “bored” is universally 6 insightful woeful wise important, “obnoxious” matters for most 7 rough haughty aggressive popular, and “powerful” really applies to 8 affectionate uncertain sad gems. 9 weird alive nasty 10 folksy silly folksy Table 4: Relevance of emotions/feeling/mood word vectors trained over MovieLens usage

2016 Spring Technical Forum Proceedings CONCLUSION effects such as [8] as well temporal effects can be better added. Even though the choice of In this paper, we experiment with the idea gradient boosted trees works well for WVCF- of providing powerful recommendations watch and WVCF-like, other local correctors ranging from discovery to high precision and stacking may further be beneficial. across an arbitrary space of multimedia assets. Another key issue is extensive data of explicit We discuss a unified framework that ratings are usually not available in linear aggregates consumption data from multiple content. Additional ways may exist, such as sources and fuses them with meta-content to latent factor and nearest neighbor obtain more accurate content and user combinations, to close the gap between an representations. The explicit ratings are used arbitrary notion of “liking” in implicit space to convert the implicit watching user behavior and explicit and/or ground truth. Another area to a notion of "likeness" based on ground- of research is to further formalize the notion truth. The usage information is then used to of “discovery” and relating it to likeness and feedback into the meta-content to determine precision. Further modifications of recall more accurate weights of the individual meta- scores based on popularity biases between content factors thereby enabling richer data spaces may help also. It will be content-content recommendations. interesting to see how this can then be used to create a parameter knob that can be controlled Methods within the unified framework were to either increase discovery vs. precision in applied to MovieLens data to frame some recommendations. common issues involved in the tradeoff between discovery (of hidden gems) versus REFERENCES recall/precision. Both watching WVCF-watch and liking WVCF-like models provide further [1] Gediminas Adomavicius, I. and I. flexibility to personalize recommendations. Alexander Tuzhilin, Toward the Next By budgeting recall to within the same Generation of Recommender Systems: A popularity bins, hidden gems are not as likely Survey of the State-of-the-Art and Possible to be ignored due to a prediction of not Extensions. Knowledge and Data Engg, IEEE watching caused by a lack of awareness of the Transactions, 2005. 17(6): p. 734-749. (unpopular) show’s existence. Modifications [2] Melville, P. and V. Sindhwani, to scoring functions were discussed to better Recommender Systems. Encyclopedia of personalize recommendations to best fit each Machine Learning, 2010. user’s watching preferences for maximal viewing enjoyment. An example fusion of [3] Miller, B.N., et al., MovieLens unplugged: word vectors to explicit ratings usage was experiences with an occasionally connected provided based on a specified vocabulary, recommender system, in Proceedings of the with patterns provided overall, for most 8th international conference on Intelligent popular shows, and gems. This is user interfaces 2003: New York, NY. complementary to tag fusion, as in this case [4] Su, X. and T.M. Khoshgoftaar, A survey each vocabulary word was universally applied of collaborative filtering techniques. to each show. Advances in Artificial Intelligence, 2009.

There are several interesting directions in [5] Musto, C., Enhanced vector space models which our research can continue in the future. for content-based recommender systems, in Care was taken to seamlessly convert all Proceedings of the fourth ACM conference on available data spaces, including metadata, into Recommender systems. 2010: Bari, Italy. factored components, but integrated local

2016 Spring Technical Forum Proceedings [6] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. [7] O. Jojic, M. Shukla, N. Bhosarekar, A Probabilistic Definition of Item Similarity, Proceedings of the 2011 ACM Conference on Recommender Systems, RecSys 2011, Chicago, IL, USA, October 23-27, 2011. [8] Y. Koren, “Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model”, Proc. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [9] GroupLens Research. Available from: http://grouplens.org/datasets/movielens [10] Vocabulary word list. Available from: https://myvocabulary.com/word-list/emotions- feelings-mood-vocabulary

2016 Spring Technical Forum Proceedings