An Analysis of Memory Based Collaborative Filtering Recommender Systems with Improvement Proposals
Total Page:16
File Type:pdf, Size:1020Kb
Master in Artificial Intelligence (UPC-URV-UB) Master of Science Thesis An Analysis of Memory Based Collaborative Filtering Recommender Systems with Improvement Proposals Claudio Adrian Levinas Advisor: María Salamó Llorente September 2014 Abstract Memory Based Collaborative Filtering Recommender Systems have been around for the best part of the last twenty years. It is a mature technology, implemented in nu- merous commercial applications. However, a departure from Memory Based systems, in favour of Model Based systems happened during the last years. The Netflix.com competition of 2006, brought the Model Based paradigm to the spotlight, with plenty of research that followed. Still, these matrix factorization based algorithms are hard to compute, and cumbersome to update. Memory Based approaches, on the other hand, are simple, fast, and self explanatory. We posit that there are still uncomplicated approaches that can be applied to improve this family of Recommender Systems further. Four strategies aimed at improving the Accuracy of Memory Based Collaborative Filtering Recommender Systems have been proposed and extensively tested. The strategies put forward include an Average Item Voting approach to infer missing rat- ings, an Indirect Estimation algorithm which pre-estimates the missing ratings before computing the overall recommendation, a Class Type Grouping strategy to filter out items of a class different than the target one, and a Weighted Ensemble consisting of an average of an estimation computed with all samples, with one obtained via the Class Type Grouping approach. This work will show that there is still ample space to improve Memory Based Systems, and raise their Accuracy to the point where they can compete with state- of-the-art Model Based approaches such as Matrix Factorization or Singular Value Decomposition techniques, which require considerable processing power, and generate models that become obsolete as soon as users add new ratings into the system. Acknowledgements Artificial Intelligence is a fascinating topic, which certainly will touch our lives in the years to come. But out of the many branches of this rich discipline, Recommender Systems attracted me particularly. This, I owe to the teachings of Mar´ıaSalam´o Llorente, who introduced me to the topic, patiently answered all my numerous ques- tions, and after completing the course, was kind enough to agree to supervise my Thesis. I admire her patience and her focus. Without her, this work would be half as interesting and half as useful. No man is an island. And I would never have made it this far, this advanced in life, without the support of my wife and my son. They are my pillars. They are my ground. They are my light. Little makes sense without them. Thank you both. Thank you for pushing me further and higher. In the memory of my mother. Dreams, like faith, might well move mountains. But reality is very keen on bringing them down. So, know your limits better than your dreams. Then, perhaps, you might achieve them one day. Contents 1 Introduction 7 1.1 Definition of the Problem . .7 1.2 Objectives of this Work . 12 1.3 Summary . 13 1.4 Reader's Guide . 14 2 State of the Art 15 2.1 Historical Overview . 15 2.2 Recommender Systems . 16 2.2.1 User vs Item Based . 17 2.2.2 Memory vs Model Based . 18 2.3 User Ratings . 19 2.3.1 Explicit vs Implicit . 19 2.3.2 Rating Scales . 20 2.3.3 Normalizations . 21 2.4 Similarity Metrics . 22 2.4.1 Pearson Correlation . 23 2.4.2 Cosine Distance . 23 2.4.3 Adjusted Cosine Distance . 24 2.4.4 Mean Squared Distance . 25 2.4.5 Euclidean Distance . 25 2.4.6 Spearman Correlation . 26 2.5 Neighbourhood . 27 2.5.1 Top N-Neighbours . 27 1 2.5.2 Threshold Filtering . 28 2.6 Rating Prediction . 28 2.6.1 Recommender Algorithm . 30 2.7 Assessment Metrics . 31 2.7.1 Coverage . 31 2.7.2 Accuracy . 31 2.8 Improvement Strategies . 32 2.8.1 Significance Weighting . 33 2.8.2 Default Voting . 34 2.8.3 Context Aware . 35 2.9 Typical Problems . 35 2.9.1 Sparsity . 36 2.9.2 Cold Start . 36 2.10 Summary . 37 3 Proposals 38 3.1 Description of Proposals . 38 3.1.1 Default Item Voting . 39 3.1.2 Indirect Estimation . 41 3.1.3 Class Type Grouping . 42 3.1.4 Weighted Ensemble . 44 3.2 Item Based Formulations . 45 3.3 Summary . 46 4 Experiments 47 4.1 Data . 47 4.2 Methodology . 49 4.3 Description . 50 4.4 Results . 51 4.5 Analysis . 56 4.6 Discussion . 65 4.7 Summary . 69 2 5 Conclusions and Future Work 70 5.1 Conclusion . 70 5.2 Future Work . 71 Bibliography 73 Appendices 79 A Fast Algorithm Alternatives 79 B Summary of Results 83 3 List of Figures 4.1 User Based Coverage vs Accuracy progression . 50 4.2 User and Item with Euclidean Distance similarity Coverage vs Accuracy 56 4.3 User and Item with Pearson Correlation similarity Coverage vs Accuracy 57 4.4 User and Item with Cosine Distance similarity Coverage vs Accuracy 57 4.5 Friedman Test of User and Item Based Approaches . 60 4.6 ANOVA Test of Similarity Functions . 61 4.7 ANOVA Test of Significance Weighting Algorithm . 61 4.8 ANOVA Test of User vs Item Based Approaches . 62 4.9 ANOVA Test of Plain, User, Item and Indirect Algorithms . 63 4.10 ANOVA Test of Grouping and Weighting Algorithms . 63 4 List of Tables 4.1 Neighbourhood size per Similarity Function and Algorithm . 49 4.2 User Based with Euclidean Distance similarity, Default Estimation . 52 4.3 Item Based with Euclidean Distance similarity, Default Estimation . 52 4.4 User Based with Pearson Correlation similarity, Default Estimation . 52 4.5 Item Based with Pearson Correlation similarity, Default Estimation . 52 4.6 User Based with Cosine Distance similarity, Default Estimation . 53 4.7 Item Based with Cosine Distance similarity, Default Estimation . 53 4.8 Prepare Indirect offline algorithm, Default Estimation . 53 4.9 User Based with Euclidean Distance similarity, Class Type Grouping 53 4.10 Item Based with Euclidean Distance similarity, Class Type Grouping 54 4.11 User Based with Pearson Correlation similarity, Class Type Grouping 54 4.12 Item Based with Pearson Correlation similarity, Class Type Grouping 54 4.13 User Based with Cosine Distance similarity, Class Type Grouping . 54 4.14 Item Based with Cosine Distance similarity, Class Type Grouping . 54 4.15 Prepare Indirect offline algorithm, Class Type Grouping . 55 4.16 User Based with Euclidean Distance similarity, Weighted Ensemble . 55 4.17 Item Based with Euclidean Distance similarity, Weighted Ensemble . 55 4.18 User Based with Pearson Correlation similarity, Weighted Ensemble . 55 4.19 Item Based with Pearson Correlation similarity, Weighted Ensemble . 55 4.20 User Based with Cosine Distance similarity, Weighted Ensemble . 56 4.21 Item Based with Cosine Distance similarity, Weighted Ensemble . 56 4.22 User Based MAE Accuracy results for all Algorithms . 58 4.23 User Based results of Friedman Test . 58 4.24 Item Based MAE Accuracy results for all Algorithms . 59 5 4.25 Item Based results of Friedman Test . 59 4.26 ANOVA Test of Similarity Functions . 60 4.27 ANOVA Test of Significance Weighting Algorithm . 61 4.28 ANOVA Test of User vs Item Based Approaches . 62 4.29 ANOVA Test of Plain, User, Item and Indirect Algorithms . 62 4.30 ANOVA Test of Grouping and Weighting Algorithms . 63 4.31 Comparative of Thesis Accuracy results against others published . 64 A.1 Fast Algorithms, Euclidean Distance similarity, Default Estimation . 80 A.2 Fast Algorithms, Pearson Correlation similarity, Default Estimation . 80 A.3 Fast Algorithms, Cosine Distance similarity, Default Estimation . 80 A.4 Fast Algorithms, Euclidean Distance similarity, Class Type Grouping 81 A.5 Fast Algorithms, Pearson Correlation similarity, Class Type Grouping 81 A.6 Fast Algorithms, Cosine Distance similarity, Class Type Grouping . 81 A.7 Fast Algorithms, Euclidean Distance similarity, Weighted Ensemble . 81 A.8 Fast Algorithms, Pearson Correlation similarity, Weighted Ensemble . 82 A.9 Fast Algorithms, Cosine Distance similarity, Weighted Ensemble . 82 B.1 Summary of Euclidean Distance similarity results . 84 B.2 Summary of Pearson Correlation similarity results . 85 B.3 Summary of Cosine Distance similarity results . 86 6 Chapter 1 Introduction 1.1 Definition of the Problem Up until the advent of the internet, shoppers browsing for potential merchandise to purchase would either follow their innate likes and tastes, or follow those of a person they trusted specially. Arguably, we all have a natural trait that compels us to classify things: this we like; this we don't. And in principle, it is possible albeit difficult, to browse all merchandise in a shop or shopping centre, and come out with the one piece that is really worth buying. The \selection by inspection" paradigm works wonders, as it places a high bar between the things we would like to consider further, and those we can immediately discard. However, there is a fundamental assumption behind the success of this behavioural pattern that bears remembering: it presupposes that the number of items we will sort is manageable. But if the internet has taught us anything in the past twenty years is that information grows exponentially, and we can't read it all. The limitation in the number of potential items one can look at in the web, is virtually non existent. We have all been presented with hundreds of millions of hits from a Google search, or pages upon pages of images when looking for a pictorial keyword. Information overflow is common place in our age.