An Analysis of Memory Based Collaborative Filtering Recommender Systems with Improvement Proposals

Total Page:16

File Type:pdf, Size:1020Kb

An Analysis of Memory Based Collaborative Filtering Recommender Systems with Improvement Proposals Master in Artificial Intelligence (UPC-URV-UB) Master of Science Thesis An Analysis of Memory Based Collaborative Filtering Recommender Systems with Improvement Proposals Claudio Adrian Levinas Advisor: María Salamó Llorente September 2014 Abstract Memory Based Collaborative Filtering Recommender Systems have been around for the best part of the last twenty years. It is a mature technology, implemented in nu- merous commercial applications. However, a departure from Memory Based systems, in favour of Model Based systems happened during the last years. The Netflix.com competition of 2006, brought the Model Based paradigm to the spotlight, with plenty of research that followed. Still, these matrix factorization based algorithms are hard to compute, and cumbersome to update. Memory Based approaches, on the other hand, are simple, fast, and self explanatory. We posit that there are still uncomplicated approaches that can be applied to improve this family of Recommender Systems further. Four strategies aimed at improving the Accuracy of Memory Based Collaborative Filtering Recommender Systems have been proposed and extensively tested. The strategies put forward include an Average Item Voting approach to infer missing rat- ings, an Indirect Estimation algorithm which pre-estimates the missing ratings before computing the overall recommendation, a Class Type Grouping strategy to filter out items of a class different than the target one, and a Weighted Ensemble consisting of an average of an estimation computed with all samples, with one obtained via the Class Type Grouping approach. This work will show that there is still ample space to improve Memory Based Systems, and raise their Accuracy to the point where they can compete with state- of-the-art Model Based approaches such as Matrix Factorization or Singular Value Decomposition techniques, which require considerable processing power, and generate models that become obsolete as soon as users add new ratings into the system. Acknowledgements Artificial Intelligence is a fascinating topic, which certainly will touch our lives in the years to come. But out of the many branches of this rich discipline, Recommender Systems attracted me particularly. This, I owe to the teachings of Mar´ıaSalam´o Llorente, who introduced me to the topic, patiently answered all my numerous ques- tions, and after completing the course, was kind enough to agree to supervise my Thesis. I admire her patience and her focus. Without her, this work would be half as interesting and half as useful. No man is an island. And I would never have made it this far, this advanced in life, without the support of my wife and my son. They are my pillars. They are my ground. They are my light. Little makes sense without them. Thank you both. Thank you for pushing me further and higher. In the memory of my mother. Dreams, like faith, might well move mountains. But reality is very keen on bringing them down. So, know your limits better than your dreams. Then, perhaps, you might achieve them one day. Contents 1 Introduction 7 1.1 Definition of the Problem . .7 1.2 Objectives of this Work . 12 1.3 Summary . 13 1.4 Reader's Guide . 14 2 State of the Art 15 2.1 Historical Overview . 15 2.2 Recommender Systems . 16 2.2.1 User vs Item Based . 17 2.2.2 Memory vs Model Based . 18 2.3 User Ratings . 19 2.3.1 Explicit vs Implicit . 19 2.3.2 Rating Scales . 20 2.3.3 Normalizations . 21 2.4 Similarity Metrics . 22 2.4.1 Pearson Correlation . 23 2.4.2 Cosine Distance . 23 2.4.3 Adjusted Cosine Distance . 24 2.4.4 Mean Squared Distance . 25 2.4.5 Euclidean Distance . 25 2.4.6 Spearman Correlation . 26 2.5 Neighbourhood . 27 2.5.1 Top N-Neighbours . 27 1 2.5.2 Threshold Filtering . 28 2.6 Rating Prediction . 28 2.6.1 Recommender Algorithm . 30 2.7 Assessment Metrics . 31 2.7.1 Coverage . 31 2.7.2 Accuracy . 31 2.8 Improvement Strategies . 32 2.8.1 Significance Weighting . 33 2.8.2 Default Voting . 34 2.8.3 Context Aware . 35 2.9 Typical Problems . 35 2.9.1 Sparsity . 36 2.9.2 Cold Start . 36 2.10 Summary . 37 3 Proposals 38 3.1 Description of Proposals . 38 3.1.1 Default Item Voting . 39 3.1.2 Indirect Estimation . 41 3.1.3 Class Type Grouping . 42 3.1.4 Weighted Ensemble . 44 3.2 Item Based Formulations . 45 3.3 Summary . 46 4 Experiments 47 4.1 Data . 47 4.2 Methodology . 49 4.3 Description . 50 4.4 Results . 51 4.5 Analysis . 56 4.6 Discussion . 65 4.7 Summary . 69 2 5 Conclusions and Future Work 70 5.1 Conclusion . 70 5.2 Future Work . 71 Bibliography 73 Appendices 79 A Fast Algorithm Alternatives 79 B Summary of Results 83 3 List of Figures 4.1 User Based Coverage vs Accuracy progression . 50 4.2 User and Item with Euclidean Distance similarity Coverage vs Accuracy 56 4.3 User and Item with Pearson Correlation similarity Coverage vs Accuracy 57 4.4 User and Item with Cosine Distance similarity Coverage vs Accuracy 57 4.5 Friedman Test of User and Item Based Approaches . 60 4.6 ANOVA Test of Similarity Functions . 61 4.7 ANOVA Test of Significance Weighting Algorithm . 61 4.8 ANOVA Test of User vs Item Based Approaches . 62 4.9 ANOVA Test of Plain, User, Item and Indirect Algorithms . 63 4.10 ANOVA Test of Grouping and Weighting Algorithms . 63 4 List of Tables 4.1 Neighbourhood size per Similarity Function and Algorithm . 49 4.2 User Based with Euclidean Distance similarity, Default Estimation . 52 4.3 Item Based with Euclidean Distance similarity, Default Estimation . 52 4.4 User Based with Pearson Correlation similarity, Default Estimation . 52 4.5 Item Based with Pearson Correlation similarity, Default Estimation . 52 4.6 User Based with Cosine Distance similarity, Default Estimation . 53 4.7 Item Based with Cosine Distance similarity, Default Estimation . 53 4.8 Prepare Indirect offline algorithm, Default Estimation . 53 4.9 User Based with Euclidean Distance similarity, Class Type Grouping 53 4.10 Item Based with Euclidean Distance similarity, Class Type Grouping 54 4.11 User Based with Pearson Correlation similarity, Class Type Grouping 54 4.12 Item Based with Pearson Correlation similarity, Class Type Grouping 54 4.13 User Based with Cosine Distance similarity, Class Type Grouping . 54 4.14 Item Based with Cosine Distance similarity, Class Type Grouping . 54 4.15 Prepare Indirect offline algorithm, Class Type Grouping . 55 4.16 User Based with Euclidean Distance similarity, Weighted Ensemble . 55 4.17 Item Based with Euclidean Distance similarity, Weighted Ensemble . 55 4.18 User Based with Pearson Correlation similarity, Weighted Ensemble . 55 4.19 Item Based with Pearson Correlation similarity, Weighted Ensemble . 55 4.20 User Based with Cosine Distance similarity, Weighted Ensemble . 56 4.21 Item Based with Cosine Distance similarity, Weighted Ensemble . 56 4.22 User Based MAE Accuracy results for all Algorithms . 58 4.23 User Based results of Friedman Test . 58 4.24 Item Based MAE Accuracy results for all Algorithms . 59 5 4.25 Item Based results of Friedman Test . 59 4.26 ANOVA Test of Similarity Functions . 60 4.27 ANOVA Test of Significance Weighting Algorithm . 61 4.28 ANOVA Test of User vs Item Based Approaches . 62 4.29 ANOVA Test of Plain, User, Item and Indirect Algorithms . 62 4.30 ANOVA Test of Grouping and Weighting Algorithms . 63 4.31 Comparative of Thesis Accuracy results against others published . 64 A.1 Fast Algorithms, Euclidean Distance similarity, Default Estimation . 80 A.2 Fast Algorithms, Pearson Correlation similarity, Default Estimation . 80 A.3 Fast Algorithms, Cosine Distance similarity, Default Estimation . 80 A.4 Fast Algorithms, Euclidean Distance similarity, Class Type Grouping 81 A.5 Fast Algorithms, Pearson Correlation similarity, Class Type Grouping 81 A.6 Fast Algorithms, Cosine Distance similarity, Class Type Grouping . 81 A.7 Fast Algorithms, Euclidean Distance similarity, Weighted Ensemble . 81 A.8 Fast Algorithms, Pearson Correlation similarity, Weighted Ensemble . 82 A.9 Fast Algorithms, Cosine Distance similarity, Weighted Ensemble . 82 B.1 Summary of Euclidean Distance similarity results . 84 B.2 Summary of Pearson Correlation similarity results . 85 B.3 Summary of Cosine Distance similarity results . 86 6 Chapter 1 Introduction 1.1 Definition of the Problem Up until the advent of the internet, shoppers browsing for potential merchandise to purchase would either follow their innate likes and tastes, or follow those of a person they trusted specially. Arguably, we all have a natural trait that compels us to classify things: this we like; this we don't. And in principle, it is possible albeit difficult, to browse all merchandise in a shop or shopping centre, and come out with the one piece that is really worth buying. The \selection by inspection" paradigm works wonders, as it places a high bar between the things we would like to consider further, and those we can immediately discard. However, there is a fundamental assumption behind the success of this behavioural pattern that bears remembering: it presupposes that the number of items we will sort is manageable. But if the internet has taught us anything in the past twenty years is that information grows exponentially, and we can't read it all. The limitation in the number of potential items one can look at in the web, is virtually non existent. We have all been presented with hundreds of millions of hits from a Google search, or pages upon pages of images when looking for a pictorial keyword. Information overflow is common place in our age.
Recommended publications
  • EFFICIENT RETRIEVAL of MATRIX FACTORIZATION-BASED TOP-K RECOMMENDATIONS
    Journal of Artificial Intelligence Research 70 (2021) 1441-1479 Submitted 09/2020; published 04/2021 Efficient Retrieval of Matrix Factorization-Based Top-k Recommendations: A Survey of Recent Approaches Dung D. Le [email protected] Hady W. Lauw [email protected] School of Computing and Information Systems Singapore Management University 80 Stamford Road, Singapore 178902 Abstract Top-k recommendation seeks to deliver a personalized list of k items to each individual user. An established methodology in the literature based on matrix factorization (MF), which usually represents users and items as vectors in low-dimensional space, is an effective approach to rec- ommender systems, thanks to its superior performance in terms of recommendation quality and scalability. A typical matrix factorization recommender system has two main phases: preference elicitation and recommendation retrieval. The former analyzes user-generated data to learn user preferences and item characteristics in the form of latent feature vectors, whereas the latter ranks the candidate items based on the learnt vectors and returns the top-k items from the ranked list. For preference elicitation, there have been numerous works to build accurate MF-based recommen- dation algorithms that can learn from large datasets. However, for the recommendation retrieval phase, naively scanning a large number of items to identify the few most relevant ones may inhibit truly real-time applications. In this work, we survey recent advances and state-of-the-art approaches in the literature that enable fast and accurate retrieval for MF-based personalized recommenda- tions. Also, we include analytical discussions of approaches along different dimensions to provide the readers with a more comprehensive understanding of the surveyed works.
    [Show full text]
  • Scalable Similarity-Based Neighborhood Methods with Mapreduce
    Scalable Similarity-Based Neighborhood Methods with MapReduce Sebastian Schelter Christoph Boden Volker Markl Technische Universität Berlin, Germany fi[email protected] ABSTRACT sons it is often undesirable to execute these offline compu- Similarity-based neighborhood methods, a simple and popu- tations on a single machine: this machine might fail and lar approach to collaborative filtering, infer their predictions with growing data sizes constant hardware upgrades might by finding users with similar taste or items that have been be necessary to improve the machine's performance to meet similarly rated. If the number of users grows to millions, the time constraints. Due to these disadvantages, a single the standard approach of sequentially examining each item machine solution can quickly become expensive and hard to and looking at all interacting users does not scale. To solve operate. this problem, we develop a MapReduce algorithm for the In order to solve this problem, recent advances in large pairwise item comparison and top-N recommendation prob- scale data processing propose to run data-intensive, analyt- lem that scales linearly with respect to a growing number ical computations in a parallel and fault-tolerant manner of users. This parallel algorithm is able to work on parti- on a large number of commodity machines. Doing so will tioned data and is general in that it supports a wide range make the execution independent of single machine failures of similarity measures. We evaluate our algorithm on a large and will furthermore allow the increase of computational dataset consisting of 700 million song ratings from Yahoo! performance by simply adding more machines to the cluster, Music.
    [Show full text]
  • Music Similarity: Learning Algorithms and Applications
    UNIVERSITY OF CALIFORNIA, SAN DIEGO More like this: machine learning approaches to music similarity A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Brian McFee Committee in charge: Professor Sanjoy Dasgupta, Co-Chair Professor Gert Lanckriet, Co-Chair Professor Serge Belongie Professor Lawrence Saul Professor Nuno Vasconcelos 2012 Copyright Brian McFee, 2012 All rights reserved. The dissertation of Brian McFee is approved, and it is ac- ceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair University of California, San Diego 2012 iii DEDICATION To my parents. Thanks for the genes, and everything since. iv EPIGRAPH I’m gonna hear my favorite song, if it takes all night.1 Frank Black, “If It Takes All Night.” 1Clearly, the author is lamenting the inefficiencies of broadcast radio programming. v TABLE OF CONTENTS Signature Page................................... iii Dedication...................................... iv Epigraph.......................................v Table of Contents.................................. vi List of Figures....................................x List of Tables.................................... xi Acknowledgements................................. xii Vita......................................... xiv Abstract of the Dissertation............................. xvi Chapter 1 Introduction.............................1 1.1 Music information retrieval..................1 1.2 Summary of contributions..................1
    [Show full text]
  • Learning Binary Codes for Efficient Large-Scale Music Similarity Search
    LEARNING BINARY CODES FOR EFFICIENT LARGE-SCALE MUSIC SIMILARITY SEARCH Jan Schluter¨ Austrian Research Institute for Artificial Intelligence, Vienna [email protected] ABSTRACT Compared to existing work on approximate k-NN search, what makes this quest special is the nature of state-of-the- Content-based music similarity estimation provides a art music similarity measures, and a low upper bound on way to find songs in the unpopular “long tail” of com- database sizes: The largest online music store only offers mercial catalogs. However, state-of-the-art music similar- 26 million songs as of February 2013, while web-scale im- ity measures are too slow to apply to large databases, as age or document retrieval needs to handle billions of items. they are based on finding nearest neighbors among very Among the first approaches to fast k-NN search were high-dimensional or non-vector song representations that space partitioning trees [1]. McFee et al. [12] use an ex- are difficult to index. tension of k-d trees on 890,000 songs, reporting a 120 fold In this work, we adopt recent machine learning methods speedup over a full scan when missing 80% of true neigh- to map such song representations to binary codes. A lin- bors. No comparison to other methods is given. ear scan over the codes quickly finds a small set of likely Hash-based methods promise cheap lookup costs. Cai neighbors for a query to be refined with the original expen- et al. [2] apply Locality-Sensitive Hashing (LSH) [6] to sive similarity measure.
    [Show full text]
  • Collaborative Filtering with Binary, Positive-Only Data
    Collaborative Filtering with Binary, Positive-only Data Proefschrift voorgelegd tot het behalen van de graad van Doctor in de Wetenschappen: Informatica aan de Universiteit Antwerpen te verdedigen door Koen VERSTREPEN Promotor: prof. dr. Bart Goethals Antwerpen, 2015 Collaborative Filtering with Binary, Positive-only Data Nederlandse titel: Collaborative Filtering met Binaire, Positieve Data Copyright © 2015 by Koen Verstrepen Acknowledgements Arriving at this point, defending my doctoral thesis, was not obvious. Not for myself, but even more so for the other people involved. First of all Kim, my wife. She must have thought that I lost my mind, giving up my old career to start a PhD in computer science. Nevertheless, she trusted me and supported me fully. Furthermore, she did not complain once about my multiple trips to conferences, which always implied that she had to take care of our son on her own for a week. I am infinitely grateful to her. Also Bart, my advisor, took a leap of faith when he hired me. I was not one of his summa cum laude master students destined to become an academic star. Instead, I was an unknown engineer that had left university already four years before. Fortunately, I somehow managed to convince him we would arrive at this point rather soon than late. It was no obvious decision for him, and I am grateful for the opportunity he gave me. Starting a PhD was one of the best decisions in my life. I enjoyed every minute of it. Not the least because of the wonderful colleagues I had throughout the years.
    [Show full text]
  • Hybrid Recommender Vae Tensorflow Implementation of Movielens
    Hybrid Recommender Vae Tensorflow Implementation Of Movielens standoffishlyFlabby Nathanial when copy geoponic inimically Rusty or mutualisedouttravel vendibly sinisterly and when overpress Bartie heris scarred. sib. Spenser Anatollo outweary often vesiculated positively. While grid search engine results are implementation of recommender systems operate over the hybrid recommender vae tensorflow implementation of movielens database benchmarking results in. Our implementation in recommender strategy gradient values at a hybrid recommender vae tensorflow implementation of movielens database. The model with the of recommender systems where managers require significantly reduce. Such recommender systems that do compressed sparse areas with vae that consider a hybrid analysis contain two hybrid recommender vae tensorflow implementation of movielens database? Along with vae variant, kotthoff et al operates directly modeling for hybrid recommender vae tensorflow implementation of movielens database? These are signiﬕcant differences concerning its control problems and vae architecture rests on movielens datasets, transparency into how certain business intelligent machines and hybrid recommender vae tensorflow implementation of movielens database that aacs is marked, projection allows parallel. Ibas schemes that could be seen countless folders and hybrid recommender vae tensorflow implementation of movielens datasets. In the shattered gradients of tensorflow object detector and generate. Our algorithm will halt if you to? Intuitive graphical structure your browser while tolerating a hybrid recommender vae tensorflow implementation of movielens database opens the implementation. We will be of hybrid recommender systems. Experimentally, the lowest deviation to input perturbations, prior labels about the anomalousness of data points are plenty available. Maximum value point outside regions were requested by sustainability, extrapolation of hybrid recommender vae tensorflow implementation of movielens datasets also like an adaptation of exactly.
    [Show full text]
  • Collaborative Hashing
    Collaborative Hashing Xianglong Liuy Junfeng Hez Cheng Deng\ Bo Langy yState Key Lab of Software Development Environment, Beihang University, Beijing, China zFacebook, 1601 Willow Rd, Menlo Park, CA, USA \Xidian University, Xi’an, China fxlliu, [email protected] [email protected] [email protected] Abstract binary codes by exploiting the data correlations among the data entities (e.g., the cosine similarities between Hashing technique has become a promising approach for feature vectors). In practice, there are many scenarios fast similarity search. Most of existing hashing research involving nearest neighbor search on the data matrix with pursue the binary codes for the same type of entities by two dimensions corresponding to two different coupled preserving their similarities. In practice, there are many views or entities. For instance, the classic textual retrieval scenarios involving nearest neighbor search on the data usually works based on the term-document matrix with given in matrix form, where two different types of, yet each element representing the correlations between two naturally associated entities respectively correspond to its views: words and documents. Recently, such bag-of- two dimensions or views. To fully explore the duality words (BoW) model has also been widely used in computer between the two views, we propose a collaborative hashing vision and multimedia retrieval, which mainly captures the scheme for the data in matrix form to enable fast search correlations between local features (even the dictionaries in various applications such as image search using bag of using sparse coding) and visual objects [16, 17]. Besides words and recommendation using user-item ratings.
    [Show full text]
  • Recommendation Systems
    Recommendation Systems CS 534: Machine Learning Slides adapted from Alex Smola, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Lester Mackey, Dietmar Jannach, and Gerhard Friedrich Recommender Systems (RecSys) Search Recommendations Items Products, web sites, blogs, news items, … CS 534 [Spring 2017] - Ho RecSys is Everywhere System that provides or suggests items to the end users CS 534 [Spring 2017] - Ho Long Tail Phenomenon Source: Chris Anderson (2004) CS 534 [Spring 2017] - Ho Physical vs Online Presence CS 534 [Spring 2017] - Ho RecSys: Tasks Task 1: Predict user rating Task 2: Top N recommendation CS 534 [Spring 2017] - Ho RecSys: Paradigms Personalized recommendation Collaborative Content- based Knowledge-based CS 534 [Spring 2017] - Ho RecSys: Evolution Collaborative filtering & Social + interest Item hierarchy: user-user similarity: graph based: You bought a People like you who Your friends like printer, will also bought beer also Lady Gaga so you need ink bought diapers will like Lady Gaga Attribute based: Collaborative filtering & Model based: You like action movies item-item similarity: Training SVM, LDA, starring Clint Eastwood, You like Godfather so SVD for implicit you will also like Good, you will like Scarface features Bad, and Ugly CS 534 [Spring 2017] - Ho RecSys: Basic Techniques Pros Cons No knowledge‐ engineering Requires some form of rating Collaborative effort, serendipity of results, feedback, cold start for new learns market segments users and new items No community required, Content descriptions Content-based comparison
    [Show full text]
  • Understanding Similarity Metrics in Neighbour-Based Recommender Systems
    Understanding Similarity Metrics in Neighbour-based Recommender Systems Alejandro Bellogín and Arjen P. de Vries Information Access, Centrum Wiskunde & Informatica Science Park 123, 1098 XG Amsterdam, The Netherlands {A.Bellogin, Arjen.de.Vries}@cwi.nl ABSTRACT items) are the primary source of evidence upon which this Neighbour-based collaborative filtering is a recommendation similarity is established. As CF algorithms exploit the ac- technique that provides meaningful and, usually, accurate tive user's ratings to make predictions, no item descriptions recommendations. The method's success depends however are needed to provide recommendations. In this paper, we critically upon the similarity metric used to find the most focus our attention to the memory-based class of CF algo- similar users (neighbours), the basis of the predictions made. rithms that are user-based. These algorithms compute user In this paper, we explore twelve features that aim to explain similarities from the user's item ratings, typically based on why some user similarity metrics perform better than oth- distance and correlation metrics [9]; items not yet seen by ers. Specifically, we define two sets of features, a first one the active user but rated by users highly similar to the ac- based on statistics computed over the distance distribution tive user (in terms of their item ratings) are then used to in the neighbourhood, and, a second one based on the near- produce the recommendations. The \similar" people found est neighbour graph. Our experiments with a public dataset (whose preferences are used to predict ratings for the active show that some of these features are able to correlate with user) are usually referred to as the active user's neighbours.
    [Show full text]
  • Multi-Feature Discrete Collaborative Filtering for Fast Cold-Start Recommendation
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Multi-Feature Discrete Collaborative Filtering for Fast Cold-Start Recommendation Yang Xu,1 Lei Zhu,1∗ Zhiyong Cheng,2 Jingjing Li,3 Jiande Sun1 1Shandong Normal University 2Shandong Computer Science Center (National Supercomputer Center in Jinan) 2Qilu University of Technology (Shandong Academy of Sciences) 3University of Electronic Science and Technology of China [email protected] Abstract tween their latent features. However, the time complexity for generating top-k items recommendation for all users is Hashing is an effective technique to address the large- O(nmr + nm log k) (Zhang, Lian, and Yang 2017). There- scale recommendation problem, due to its high computation and storage efficiency on calculating the user preferences fore, MF-based methods are often computational expensive on items. However, existing hashing-based recommendation and inefficient when handling the large-scale recommenda- methods still suffer from two important problems: 1) Their tion applications (Cheng et al. 2018; 2019). recommendation process mainly relies on the user-item in- Recent studies show that the hashing-based recommen- teractions and single specific content feature. When the in- dation algorithms, which encode both users and items into teraction history or the content feature is unavailable (the binary codes in Hamming space, are promising to tackle cold-start problem), their performance will be seriously de- teriorated. 2) Existing methods learn the hash codes with re- the efficiency challenge (Zhang et al. 2016; 2014). In these laxed optimization or adopt discrete coordinate descent to methods, the preference score could be efficiently computed directly solve binary hash codes, which results in signifi- by Hamming distance.
    [Show full text]
  • FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems∗
    FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems∗ Hui Liy, Tsz Nam Chanx, Man Lung Yiux, Nikos Mamoulisy yThe University of Hong Kong xHong Kong Polytechnic University y{hli2, nikos}@cs.hku.hk x{cstnchan, csmlyiu}@comp.polyu.edu.hk ABSTRACT learning, approximation theory, and various heuristics [2]. Exam- Recommender systems have many successful applications in e- ple techniques include collaborative filtering [33], user-item graph commerce and social media, including Amazon, Netflix, and Yelp. models [4], regression based models [37] and matrix factorization Matrix Factorization (MF) is one of the most popular recommen- (MF) [26]. In this paper, we focus on MF, due to its prevalence in dation approaches; the original user-product rating matrix R with dealing with large user-item rating matrices. Let R be a m × n millions of rows and columns is decomposed into a user matrix Q matrix with the ratings of m users on n items. MF approximately and an item matrix P, such that the product QT P approximates factorizes R and computes the mapping of each user and item to a d-dimensional factor vector, where d minfm; ng. The output R. Each column q (p) of Q (P) holds the latent factors of the d×m corresponding user (item), and qT p is a prediction of the rating of the learning phase based on MF is a user matrix Q 2 R , where the i-th column is the factor vector of the i-th user, and an to item p by user q. Recommender systems based on MF sug- d×n gest to a user in q the items with the top-k scores in qT P.
    [Show full text]
  • Fishing in the Stream: Similarity Search Over Endless Data
    Fishing in the Stream: Similarity Search over Endless Data Naama Kraus David Carmel Idit Keidar Viterbi EE Department Yahoo Research Viterbi EE Department Technion, Haifa, Israel Technion, Haifa, Israel and Yahoo Research ABSTRACT applications ought to take into account temporal metrics in Similarity search is the task of retrieving data items that addition to similarity [19, 30, 29, 25, 31, 40, 23, 36, 39, 34]. are similar to a given query. In this paper, we introduce Nevertheless, the similarity search primitive has not been the time-sensitive notion of similarity search over endless extended to handle endless data-streams. To this end, we data-streams (SSDS), which takes into account data quality introduce in Section 2 the problem of similarity search over and temporal characteristics in addition to similarity. SSDS data streams (SSDS). is challenging as it needs to process unbounded data, while In order to efficiently retrieve such content at runtime, computation resources are bounded. We propose Stream- an SSDS algorithm needs to maintain an index of streamed LSH, a randomized SSDS algorithm that bounds the index data. The challenge, however, is that the stream is un- size by retaining items according to their freshness, quality, bounded, whereas physical space capacity cannot grow with- and dynamic popularity attributes. We analytically show out bound; this limitation is particularly acute when the in- that Stream-LSH increases the probability to find similar dex resides in RAM for fast retrieval [39, 33]. A key aspect items compared to alternative approaches using the same of an SSDS algorithm is therefore its retention policy, which space capacity.
    [Show full text]